Attention Mechanism
A technique that lets AI models focus on the most relevant parts of their input when generating each piece of output, forming the core of transformer architecture.
The attention mechanism is the breakthrough idea that powers modern AI language models. It allows a model to focus on different parts of its input when producing each part of its output, rather than processing everything with equal weight.
The problem attention solves
Before attention, language models processed text sequentially β one word at a time, left to right. They struggled with long sentences because early words faded from memory by the time the model reached the end. Attention lets the model look at the entire input at once and decide which parts matter most for the current task.
How attention works (simplified)
When processing the sentence "The cat sat on the mat because it was tired," the model needs to figure out that "it" refers to "the cat." With attention, the model assigns a relevance score between every pair of words. The word "it" would have a high attention score with "cat" and a low score with "mat." These scores are learned during training.
Self-attention and multi-head attention
- Self-attention lets every word in a sequence attend to every other word, capturing relationships regardless of distance
- Multi-head attention runs several attention calculations in parallel, each learning different types of relationships (grammatical, semantic, positional). This is like reading a sentence from multiple angles simultaneously
Why attention changed everything
The 2017 paper "Attention Is All You Need" introduced the transformer architecture, which replaced sequential processing entirely with attention. This unlocked massive parallelisation during training, allowing models to scale to billions of parameters. Every major language model today β GPT, Claude, Gemini, Llama β is built on attention.
Attention beyond text
Attention is not limited to language. Vision transformers apply attention to image patches. Multimodal models use cross-attention to connect text with images or audio. The mechanism has become a universal building block in AI.
Why This Matters
Attention is the reason modern AI can handle long documents, follow complex instructions, and maintain context across lengthy conversations. Understanding it helps you grasp why context windows matter, why longer inputs cost more, and why some tasks require models with larger attention capacity.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: How LLMs Actually Work