Core AI

Self-Attention

Last reviewed: April 2026

The mechanism inside transformers that allows each word to consider every other word in the input when determining its meaning and importance.

Self-attention is the core mechanism inside transformer models that allows each element in a sequence to look at and weigh the importance of every other element when computing its representation. It is what enables transformers to understand context, resolve ambiguity, and maintain coherence across long texts.

The basic concept

When you read the sentence "The trophy would not fit in the suitcase because it was too big," you instantly know "it" refers to the trophy (because the trophy is too big to fit). Self-attention gives AI models a similar ability to determine which words are most relevant to each other.

For every word in the input, self-attention computes:

Query: "What am I looking for?" — what information this word needs from other words.
Key: "What do I have to offer?" — what information this word can provide to other words.
Value: "What is my actual content?" — the information that gets passed along if this word is relevant.

The model compares each word's query against every other word's key to determine relevance, then uses those relevance scores to create a weighted combination of all words' values.

Multi-head attention

Rather than computing attention once, transformers use multiple attention "heads" running in parallel. Each head learns to focus on different types of relationships:

One head might learn grammatical relationships (subject-verb agreement).
Another might learn semantic relationships (synonyms, antonyms).
Another might learn positional relationships (nearby words vs distant ones).

The outputs of all heads are combined, giving the model a rich, multi-faceted understanding of each word's context.

Why self-attention is revolutionary

Before self-attention, models processed text sequentially and struggled with long-range dependencies. A word at the beginning of a long document could lose its connection to a word at the end. Self-attention connects every word to every other word directly, regardless of distance.

Computational cost

The major downside of self-attention is that it scales quadratically with sequence length. Processing twice as many tokens takes four times as much computation. This is why context windows have practical limits and why researchers are actively developing more efficient attention variants.

Efficient attention variants

Sparse attention: Only computing attention for a subset of word pairs.
Flash attention: An optimised implementation that is mathematically equivalent but much faster.
Linear attention: Approximations that reduce complexity from quadratic to linear.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Self-attention is the breakthrough that powers every modern AI assistant you use. Understanding it at a conceptual level helps you grasp why transformers are so good at understanding context, why long documents can be challenging, and why context window size is a key model differentiator.

Related Terms

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Positional Encoding

A technique that gives transformer models information about word order, since the attention mechanism alone does not know which words come first.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Understanding Model Architectures

← Back to Glossary