Core AI

Flash Attention

Last reviewed: April 2026

An optimised algorithm for computing the attention mechanism in transformers that dramatically reduces memory usage and speeds up processing.

Flash Attention is an algorithm that computes the attention mechanism in transformer models much more efficiently than the standard approach. Developed by Tri Dao and collaborators at Stanford, it has become a foundational optimisation that makes modern large language models practical to train and run.

The attention bottleneck

The attention mechanism is the core of transformer models. It allows each token to "attend to" every other token, determining which parts of the input are most relevant at each step. However, standard attention requires creating a massive matrix that grows quadratically with sequence length. A 4,000-token input creates a 4,000 x 4,000 attention matrix with 16 million values. A 100,000-token input creates a matrix with 10 billion values. This quickly exceeds available GPU memory.

How Flash Attention works

The key insight is about hardware efficiency. GPUs have fast but small on-chip memory (SRAM) and slow but large main memory (HBM). Standard attention moves data between these memory levels repeatedly, creating a bottleneck. Flash Attention restructures the computation to keep data in fast on-chip memory as much as possible.

It processes the attention matrix in tiles — small blocks that fit in SRAM — computing partial results and combining them without ever materialising the full attention matrix in slow memory. The mathematical result is identical to standard attention, but the computation is 2-4x faster and uses far less memory.

Impact on the AI industry

Flash Attention has been transformative. It enabled the training of models with much longer context windows — going from 2,000 tokens to 100,000 or even 1 million tokens would be impractical without it. It reduced the cost of both training and running large models. It has been integrated into virtually every major AI framework and is used by nearly all leading model providers.

Versions and evolution

Flash Attention 2 improved on the original with better parallelism and work partitioning, achieving close to theoretical maximum hardware utilisation. Flash Attention 3 introduced further optimisations for newer GPU architectures. Each version makes transformers more efficient without changing the model itself.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Flash Attention is one of the most impactful engineering innovations in modern AI. It explains how AI providers can offer models with long context windows at affordable prices and why inference costs have dropped dramatically over the past two years.

Related Terms

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How Language Models Generate Text

← Back to Glossary

Flash Attention

Last reviewed: April 2026

An optimised algorithm for computing the attention mechanism in transformers that dramatically reduces memory usage and speeds up processing.