Core AI

Attention Mechanism

Last reviewed: April 2026

A technique that lets AI models focus on the most relevant parts of their input when generating each piece of output, forming the core of transformer architecture.

The attention mechanism is the breakthrough idea that powers modern AI language models. It allows a model to focus on different parts of its input when producing each part of its output, rather than processing everything with equal weight.

The problem attention solves

Before attention, language models processed text sequentially — one word at a time, left to right. They struggled with long sentences because early words faded from memory by the time the model reached the end. Attention lets the model look at the entire input at once and decide which parts matter most for the current task.

How attention works (simplified)

When processing the sentence "The cat sat on the mat because it was tired," the model needs to figure out that "it" refers to "the cat." With attention, the model assigns a relevance score between every pair of words. The word "it" would have a high attention score with "cat" and a low score with "mat." These scores are learned during training.

Self-attention and multi-head attention

Self-attention lets every word in a sequence attend to every other word, capturing relationships regardless of distance
Multi-head attention runs several attention calculations in parallel, each learning different types of relationships (grammatical, semantic, positional). This is like reading a sentence from multiple angles simultaneously

Why attention changed everything

The 2017 paper "Attention Is All You Need" introduced the transformer architecture, which replaced sequential processing entirely with attention. This unlocked massive parallelisation during training, allowing models to scale to billions of parameters. Every major language model today — GPT, Claude, Gemini, Llama — is built on attention.

Attention beyond text

Attention is not limited to language. Vision transformers apply attention to image patches. Multimodal models use cross-attention to connect text with images or audio. The mechanism has become a universal building block in AI.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Attention is the reason modern AI can handle long documents, follow complex instructions, and maintain context across lengthy conversations. Understanding it helps you grasp why context windows matter, why longer inputs cost more, and why some tasks require models with larger attention capacity.

Related Terms

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How LLMs Actually Work

← Back to Glossary