Self-Attention
The mechanism inside transformers that allows each word to consider every other word in the input when determining its meaning and importance.
Self-attention is the core mechanism inside transformer models that allows each element in a sequence to look at and weigh the importance of every other element when computing its representation. It is what enables transformers to understand context, resolve ambiguity, and maintain coherence across long texts.
The basic concept
When you read the sentence "The trophy would not fit in the suitcase because it was too big," you instantly know "it" refers to the trophy (because the trophy is too big to fit). Self-attention gives AI models a similar ability to determine which words are most relevant to each other.
For every word in the input, self-attention computes:
- Query: "What am I looking for?" β what information this word needs from other words.
- Key: "What do I have to offer?" β what information this word can provide to other words.
- Value: "What is my actual content?" β the information that gets passed along if this word is relevant.
The model compares each word's query against every other word's key to determine relevance, then uses those relevance scores to create a weighted combination of all words' values.
Multi-head attention
Rather than computing attention once, transformers use multiple attention "heads" running in parallel. Each head learns to focus on different types of relationships:
- One head might learn grammatical relationships (subject-verb agreement).
- Another might learn semantic relationships (synonyms, antonyms).
- Another might learn positional relationships (nearby words vs distant ones).
The outputs of all heads are combined, giving the model a rich, multi-faceted understanding of each word's context.
Why self-attention is revolutionary
Before self-attention, models processed text sequentially and struggled with long-range dependencies. A word at the beginning of a long document could lose its connection to a word at the end. Self-attention connects every word to every other word directly, regardless of distance.
Computational cost
The major downside of self-attention is that it scales quadratically with sequence length. Processing twice as many tokens takes four times as much computation. This is why context windows have practical limits and why researchers are actively developing more efficient attention variants.
Efficient attention variants
- Sparse attention: Only computing attention for a subset of word pairs.
- Flash attention: An optimised implementation that is mathematically equivalent but much faster.
- Linear attention: Approximations that reduce complexity from quadratic to linear.
Why This Matters
Self-attention is the breakthrough that powers every modern AI assistant you use. Understanding it at a conceptual level helps you grasp why transformers are so good at understanding context, why long documents can be challenging, and why context window size is a key model differentiator.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: Understanding Model Architectures