Flash Attention
An optimised algorithm for computing the attention mechanism in transformers that dramatically reduces memory usage and speeds up processing.
Flash Attention is an algorithm that computes the attention mechanism in transformer models much more efficiently than the standard approach. Developed by Tri Dao and collaborators at Stanford, it has become a foundational optimisation that makes modern large language models practical to train and run.
The attention bottleneck
The attention mechanism is the core of transformer models. It allows each token to "attend to" every other token, determining which parts of the input are most relevant at each step. However, standard attention requires creating a massive matrix that grows quadratically with sequence length. A 4,000-token input creates a 4,000 x 4,000 attention matrix with 16 million values. A 100,000-token input creates a matrix with 10 billion values. This quickly exceeds available GPU memory.
How Flash Attention works
The key insight is about hardware efficiency. GPUs have fast but small on-chip memory (SRAM) and slow but large main memory (HBM). Standard attention moves data between these memory levels repeatedly, creating a bottleneck. Flash Attention restructures the computation to keep data in fast on-chip memory as much as possible.
It processes the attention matrix in tiles β small blocks that fit in SRAM β computing partial results and combining them without ever materialising the full attention matrix in slow memory. The mathematical result is identical to standard attention, but the computation is 2-4x faster and uses far less memory.
Impact on the AI industry
Flash Attention has been transformative. It enabled the training of models with much longer context windows β going from 2,000 tokens to 100,000 or even 1 million tokens would be impractical without it. It reduced the cost of both training and running large models. It has been integrated into virtually every major AI framework and is used by nearly all leading model providers.
Versions and evolution
Flash Attention 2 improved on the original with better parallelism and work partitioning, achieving close to theoretical maximum hardware utilisation. Flash Attention 3 introduced further optimisations for newer GPU architectures. Each version makes transformers more efficient without changing the model itself.
Why This Matters
Flash Attention is one of the most impactful engineering innovations in modern AI. It explains how AI providers can offer models with long context windows at affordable prices and why inference costs have dropped dramatically over the past two years.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: How Language Models Generate Text