Skip to main content
Early access β€” new tools and guides added regularly
Core AI

Attention Mechanism

Last reviewed: April 2026

A technique that lets AI models focus on the most relevant parts of their input when generating each piece of output, forming the core of transformer architecture.

The attention mechanism is the breakthrough idea that powers modern AI language models. It allows a model to focus on different parts of its input when producing each part of its output, rather than processing everything with equal weight.

The problem attention solves

Before attention, language models processed text sequentially β€” one word at a time, left to right. They struggled with long sentences because early words faded from memory by the time the model reached the end. Attention lets the model look at the entire input at once and decide which parts matter most for the current task.

How attention works (simplified)

When processing the sentence "The cat sat on the mat because it was tired," the model needs to figure out that "it" refers to "the cat." With attention, the model assigns a relevance score between every pair of words. The word "it" would have a high attention score with "cat" and a low score with "mat." These scores are learned during training.

Self-attention and multi-head attention

  • Self-attention lets every word in a sequence attend to every other word, capturing relationships regardless of distance
  • Multi-head attention runs several attention calculations in parallel, each learning different types of relationships (grammatical, semantic, positional). This is like reading a sentence from multiple angles simultaneously

Why attention changed everything

The 2017 paper "Attention Is All You Need" introduced the transformer architecture, which replaced sequential processing entirely with attention. This unlocked massive parallelisation during training, allowing models to scale to billions of parameters. Every major language model today β€” GPT, Claude, Gemini, Llama β€” is built on attention.

Attention beyond text

Attention is not limited to language. Vision transformers apply attention to image patches. Multimodal models use cross-attention to connect text with images or audio. The mechanism has become a universal building block in AI.

Want to go deeper?
This topic is covered in our Advanced level. Access all 60+ lessons free.

Why This Matters

Attention is the reason modern AI can handle long documents, follow complex instructions, and maintain context across lengthy conversations. Understanding it helps you grasp why context windows matter, why longer inputs cost more, and why some tasks require models with larger attention capacity.

Related Terms

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How LLMs Actually Work