Core AI

Transformer

Last reviewed: April 2026

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

The transformer is a type of neural network architecture introduced in a 2017 research paper titled "Attention Is All You Need." It is the foundation of every major AI assistant you use today — ChatGPT, Claude, Gemini, and Llama are all built on transformer architecture.

What problem transformers solved

Before transformers, AI processed language one word at a time, in order. This sequential approach had two problems: it was slow, and it struggled with long text because by the time it reached the end of a paragraph, it had partly "forgotten" the beginning.

Transformers solved both problems by processing all words simultaneously and using a mechanism called attention to determine which words in the input are most relevant to each other.

How attention works

Imagine reading the sentence: "The bank by the river was covered in wildflowers." The word "bank" could mean a financial institution or a riverbank. You know it means riverbank because you pay attention to "river" and "wildflowers" in the same sentence.

Transformers do something similar mathematically. For every word in the input, the attention mechanism calculates a relevance score against every other word. This lets the model understand context, resolve ambiguity, and maintain coherence across long passages.

This attention mechanism is applied across multiple "heads" simultaneously — each head learns to focus on different types of relationships (grammatical structure, meaning, tone, etc.). This multi-head attention is what gives transformers their remarkable ability to understand nuance.

Why transformers changed everything

Three properties of transformers combined to create the AI revolution:

Parallelism: Because transformers process all words at once (not one at a time), they can be trained much faster using modern GPU hardware. This made it practical to train on trillions of words.
Scalability: Transformer performance improves predictably as you add more parameters and training data. This "scaling law" gave researchers confidence that bigger models would be better models.
Flexibility: The same architecture works for text, code, images, audio, and video. This versatility means one architecture serves dozens of use cases.

The "T" in GPT and BERT

You will see transformer referenced in many AI product names. GPT stands for Generative Pre-trained Transformer. BERT (Google's earlier model) stands for Bidirectional Encoder Representations from Transformers. The transformer is the shared foundation.

Encoder vs decoder

Transformers come in two main variants:

Encoder models (like BERT) are designed to understand text — useful for classification, search, and analysis.
Decoder models (like GPT and Claude) are designed to generate text — useful for writing, conversation, and content creation.
Encoder-decoder models combine both — useful for translation and summarisation.

Most AI assistants you interact with are decoder-based, which is why they are so good at generating human-like text.

Want to go deeper?

This topic is covered in our Foundations level. Access all 100+ lessons free.

Why This Matters

The transformer is not just an academic concept — it is the reason AI went from a research curiosity to a tool that every business is now evaluating. Understanding that all major AI assistants share this same foundational architecture helps you recognise that the differences between products (ChatGPT vs Claude vs Gemini) are about training data, fine-tuning, and product design, not fundamentally different technologies. This knowledge prevents vendor lock-in thinking and helps you evaluate AI tools more objectively.

Related Terms

Neural Network

A computing system loosely inspired by the human brain, made of layers of interconnected nodes that learn to recognise patterns in data.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Related Comparisons

Anthropic vs OpenAI

Anthropic and OpenAI compared as AI companies and platforms — models, APIs, safety philosophy, developer experience, pricing, and enterprise features.

Learn More

Continue learning in Foundations

This topic is covered in our lesson: How Large Language Models Actually Work

← Back to Glossary