Practical

Long-Context Models

Last reviewed: April 2026

AI models designed to process very large inputs — hundreds of thousands or millions of tokens — enabling analysis of entire books, codebases, or document collections in a single prompt.

Long-context models are large language models that can process extremely long inputs — ranging from 100,000 to over 1 million tokens — in a single prompt. This represents a dramatic expansion from earlier models that were limited to a few thousand tokens.

The context window evolution

Early transformer models like GPT-2 had context windows of about 1,000 tokens. GPT-3 expanded to 4,000. Claude and GPT-4 pushed to 100,000-128,000 tokens. Claude now supports up to 1,000,000 tokens, and Google's Gemini has also demonstrated million-token contexts. This progression has opened entirely new use cases.

What long context enables

Whole-document analysis: Process an entire book, legal contract, or financial report in a single prompt instead of chunking it into pieces.
Codebase understanding: Feed an entire codebase to the model for debugging, refactoring, or documentation.
Multi-document reasoning: Compare multiple documents simultaneously — contracts, reports, research papers.
Extended conversations: Maintain coherent conversations over many turns without losing earlier context.
Repository-level code generation: Understand project structure and dependencies when writing new code.

Technical challenges

Processing long contexts is computationally expensive. The standard attention mechanism scales quadratically with sequence length — doubling the context length quadruples the computation. Innovations like Flash Attention, sliding window attention, and various sparse attention methods have made long contexts practical.

Memory is another challenge. Storing the key-value cache for a million tokens requires substantial GPU memory. Techniques like KV-cache compression and quantization help manage this.

The "lost in the middle" problem

Research has shown that models sometimes pay less attention to information in the middle of very long contexts, focusing more on the beginning and end. This means placing critical information at the beginning or end of a long prompt can improve results. Model providers are actively working to improve attention distribution across long contexts.

Long context vs RAG

Long-context models do not eliminate the need for retrieval-augmented generation. RAG is still more cost-effective when searching across millions of documents. But for tasks involving tens or hundreds of pages, long-context models offer a simpler approach — just include everything in the prompt.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Long-context models are transforming how professionals interact with large documents and complex information. Understanding their capabilities and limitations helps you design more effective AI workflows and choose the right approach for document-heavy tasks.

Related Terms

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Retrieval-Augmented Generation (RAG)

A technique that connects AI to your own documents and data so it can answer questions using your specific information, not just its general training.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Working with Documents and Data

← Back to Glossary