Multi-Head Attention
A mechanism in transformer models that runs multiple attention operations in parallel, allowing the model to focus on different types of relationships between words simultaneously.
Multi-head attention is a core component of the transformer architecture β the foundation of every modern large language model including ChatGPT, Claude, and Gemini. It allows the model to attend to different types of relationships between tokens simultaneously, dramatically improving its ability to understand language.
Single attention, revisited
In standard self-attention, the model computes how much each word in a sentence should attend to every other word. For the sentence "The cat sat on the mat because it was tired," attention helps the model determine that "it" refers to "the cat" rather than "the mat."
However, a single attention operation captures only one type of relationship at a time. In the same sentence, we might care about several different relationships simultaneously:
- Syntactic: "sat" relates to "cat" as its subject
- Referential: "it" refers back to "cat"
- Spatial: "on" connects "sat" to "mat"
- Causal: "because" connects "sat on the mat" to "was tired"
How multi-head attention works
Multi-head attention runs multiple attention operations in parallel β typically 8 to 128 "heads." Each head:
- Projects the input into a lower-dimensional space using its own learned transformation
- Computes attention within that space
- Captures a different type of pattern or relationship
The outputs of all heads are concatenated and projected back to the original dimension. The result is a rich representation that simultaneously captures multiple types of relationships.
What different heads learn
Research analysing trained transformers has found that individual attention heads specialise in different linguistic phenomena:
- Some heads track syntactic structure (subject-verb-object relationships)
- Some heads handle coreference (which pronoun refers to which noun)
- Some heads attend to positional patterns (previous word, next word)
- Some heads capture semantic similarity between distant words
This specialisation emerges naturally from training β it is not programmed in. The model discovers that dividing its attention capacity among multiple specialised heads is more effective than using a single monolithic attention operation.
Computational efficiency
Counterintuitively, multi-head attention is not more expensive than single attention. Because each head operates in a lower-dimensional space (the full dimension divided by the number of heads), the total computation is roughly the same. You get multiple perspectives at no additional cost.
Grouped-query attention (GQA)
Modern models like Llama 2 and Mistral use a variant called grouped-query attention, where multiple query heads share a single set of key-value heads. This significantly reduces memory usage (particularly in the KV cache) with minimal impact on quality.
Why this matters
Multi-head attention is the mechanism that gives transformers their remarkable ability to understand language. It is the reason these models can simultaneously track grammar, meaning, context, and reference across long passages of text β capabilities that eluded earlier architectures.
Why This Matters
Multi-head attention is the core innovation that makes modern AI language understanding possible. Understanding it helps you appreciate why transformers are so effective, why context window limits exist, and why some tasks require models with more attention heads or longer context windows.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: AI Infrastructure and Deployment
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β