Transformer
The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.
The transformer is a type of neural network architecture introduced in a 2017 research paper titled "Attention Is All You Need." It is the foundation of every major AI assistant you use today — ChatGPT, Claude, Gemini, and Llama are all built on transformer architecture.
What problem transformers solved
Before transformers, AI processed language one word at a time, in order. This sequential approach had two problems: it was slow, and it struggled with long text because by the time it reached the end of a paragraph, it had partly "forgotten" the beginning.
Transformers solved both problems by processing all words simultaneously and using a mechanism called attention to determine which words in the input are most relevant to each other.
How attention works
Imagine reading the sentence: "The bank by the river was covered in wildflowers." The word "bank" could mean a financial institution or a riverbank. You know it means riverbank because you pay attention to "river" and "wildflowers" in the same sentence.
Transformers do something similar mathematically. For every word in the input, the attention mechanism calculates a relevance score against every other word. This lets the model understand context, resolve ambiguity, and maintain coherence across long passages.
This attention mechanism is applied across multiple "heads" simultaneously — each head learns to focus on different types of relationships (grammatical structure, meaning, tone, etc.). This multi-head attention is what gives transformers their remarkable ability to understand nuance.
Why transformers changed everything
Three properties of transformers combined to create the AI revolution:
- Parallelism: Because transformers process all words at once (not one at a time), they can be trained much faster using modern GPU hardware. This made it practical to train on trillions of words.
- Scalability: Transformer performance improves predictably as you add more parameters and training data. This "scaling law" gave researchers confidence that bigger models would be better models.
- Flexibility: The same architecture works for text, code, images, audio, and video. This versatility means one architecture serves dozens of use cases.
The "T" in GPT and BERT
You will see transformer referenced in many AI product names. GPT stands for Generative Pre-trained Transformer. BERT (Google's earlier model) stands for Bidirectional Encoder Representations from Transformers. The transformer is the shared foundation.
Encoder vs decoder
Transformers come in two main variants:
- Encoder models (like BERT) are designed to understand text — useful for classification, search, and analysis.
- Decoder models (like GPT and Claude) are designed to generate text — useful for writing, conversation, and content creation.
- Encoder-decoder models combine both — useful for translation and summarisation.
Most AI assistants you interact with are decoder-based, which is why they are so good at generating human-like text.
Why This Matters
The transformer is not just an academic concept — it is the reason AI went from a research curiosity to a tool that every business is now evaluating. Understanding that all major AI assistants share this same foundational architecture helps you recognise that the differences between products (ChatGPT vs Claude vs Gemini) are about training data, fine-tuning, and product design, not fundamentally different technologies. This knowledge prevents vendor lock-in thinking and helps you evaluate AI tools more objectively.
Related Terms
Continue learning in Foundations
This topic is covered in our lesson: How Large Language Models Actually Work