Attention Budget
The practical limit on how much information an AI model can effectively focus on within its context window, where performance degrades as the window fills up.
An attention budget is the practical limit on how much information an AI model can meaningfully attend to at once. While a model's context window might technically accept 100,000 or even 1,000,000 tokens, the model's ability to make effective use of all that information is not uniform β it degrades as the window fills up.
Context window versus attention budget
A context window is a hard technical limit: the maximum number of tokens the model can process in a single interaction. An attention budget is a softer, practical concept: the amount of information within that window that the model can effectively use.
Think of it like reading a 500-page book versus remembering everything in it. You can physically read all 500 pages (context window), but your ability to recall and connect specific details from page 12 with details from page 487 (attention budget) is limited.
Why attention budgets matter
Research consistently shows that language models perform worse on information placed in the middle of long contexts β a phenomenon sometimes called "lost in the middle." Information at the beginning and end of the context tends to be processed more effectively.
This has practical implications:
- Document analysis: If you paste a 50-page document and ask a question, the model may miss relevant information that happens to fall in the middle.
- Multi-document tasks: Stuffing multiple long documents into the context is less effective than strategically selecting the most relevant passages.
- System prompts: Very long system prompts consume attention budget that could be used for the actual task.
Strategies for managing attention budgets
- Prioritise what goes in: Include only the most relevant information. Quality beats quantity.
- Structure matters: Place the most important information at the beginning and end of the context. Use clear headers and formatting.
- Chunking: Break long documents into sections and process them separately, then synthesise the results.
- Retrieval augmented generation (RAG): Instead of dumping everything into the context, use search to retrieve only the relevant passages.
The evolving landscape
Model providers are actively working to improve effective attention over long contexts. Techniques like sparse attention, sliding window attention, and improved positional encodings are extending practical attention budgets. But the gap between theoretical context window size and practical attention budget remains relevant for anyone building AI-powered applications.
Why This Matters
Understanding attention budgets helps you get better results from AI tools. Simply dumping more information into the context window does not guarantee better answers β strategic information management consistently outperforms the brute-force approach.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Mastering Prompt Engineering for Work
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β