Core AI

Multi-Head Attention

Last reviewed: April 2026

A mechanism in transformer models that runs multiple attention operations in parallel, allowing the model to focus on different types of relationships between words simultaneously.

Multi-head attention is a core component of the transformer architecture — the foundation of every modern large language model including ChatGPT, Claude, and Gemini. It allows the model to attend to different types of relationships between tokens simultaneously, dramatically improving its ability to understand language.

Single attention, revisited

In standard self-attention, the model computes how much each word in a sentence should attend to every other word. For the sentence "The cat sat on the mat because it was tired," attention helps the model determine that "it" refers to "the cat" rather than "the mat."

However, a single attention operation captures only one type of relationship at a time. In the same sentence, we might care about several different relationships simultaneously:

Syntactic: "sat" relates to "cat" as its subject
Referential: "it" refers back to "cat"
Spatial: "on" connects "sat" to "mat"
Causal: "because" connects "sat on the mat" to "was tired"

How multi-head attention works

Multi-head attention runs multiple attention operations in parallel — typically 8 to 128 "heads." Each head:

Projects the input into a lower-dimensional space using its own learned transformation
Computes attention within that space
Captures a different type of pattern or relationship

The outputs of all heads are concatenated and projected back to the original dimension. The result is a rich representation that simultaneously captures multiple types of relationships.

What different heads learn

Research analysing trained transformers has found that individual attention heads specialise in different linguistic phenomena:

Some heads track syntactic structure (subject-verb-object relationships)
Some heads handle coreference (which pronoun refers to which noun)
Some heads attend to positional patterns (previous word, next word)
Some heads capture semantic similarity between distant words

This specialisation emerges naturally from training — it is not programmed in. The model discovers that dividing its attention capacity among multiple specialised heads is more effective than using a single monolithic attention operation.

Computational efficiency

Counterintuitively, multi-head attention is not more expensive than single attention. Because each head operates in a lower-dimensional space (the full dimension divided by the number of heads), the total computation is roughly the same. You get multiple perspectives at no additional cost.

Grouped-query attention (GQA)

Modern models like Llama 2 and Mistral use a variant called grouped-query attention, where multiple query heads share a single set of key-value heads. This significantly reduces memory usage (particularly in the KV cache) with minimal impact on quality.

Why this matters

Multi-head attention is the mechanism that gives transformers their remarkable ability to understand language. It is the reason these models can simultaneously track grammar, meaning, context, and reference across long passages of text — capabilities that eluded earlier architectures.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Multi-head attention is the core innovation that makes modern AI language understanding possible. Understanding it helps you appreciate why transformers are so effective, why context window limits exist, and why some tasks require models with more attention heads or longer context windows.

Related Terms

Self-Attention

The mechanism inside transformers that allows each word to consider every other word in the input when determining its meaning and importance.

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Attention Mechanism

A technique that lets AI models focus on the most relevant parts of their input when generating each piece of output, forming the core of transformer architecture.

KV Cache (Key-Value Cache)

A memory optimisation technique that stores previously computed attention values during text generation, avoiding redundant calculations and significantly speeding up AI model inference.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Infrastructure and Deployment

← Back to Glossary