Practical

Prompt Compression

Last reviewed: April 2026

Techniques for reducing the token count of prompts while preserving their essential meaning, enabling cost savings and fitting more useful context into limited context windows.

Prompt compression is the practice of reducing the number of tokens in a prompt while preserving its essential meaning and effectiveness. As AI model usage scales, the cost and performance impact of long prompts becomes significant, making compression an increasingly valuable technique.

Why prompt compression matters

Every token in a prompt costs money (API pricing is per-token) and time (more tokens mean slower responses). For an application handling millions of requests, even small reductions in prompt length translate to meaningful cost savings and latency improvements.

Additionally, every token used for instructions or context is a token that cannot be used for the model's response or additional context. With limited context windows, efficiency matters.

Compression techniques

Manual optimisation: Rewriting prompts to be more concise without losing meaning. "Please provide a detailed, comprehensive summary of the following document, making sure to cover all the key points" becomes "Summarise this document, covering all key points."
Instruction consolidation: Combining multiple separate instructions into unified, denser formats. Instead of five separate rules, create one concise paragraph.
Example pruning: In few-shot prompts, using the minimum number of examples needed. Often 2-3 well-chosen examples work as well as 5-10.
Structured formats: Using concise structured formats (JSON, YAML, or shorthand notation) instead of verbose natural language for data and instructions.
LLMLingua and similar tools: Automated prompt compression tools that use a smaller language model to identify and remove tokens that contribute least to the prompt's meaning.

Token-level compression

Research has shown that not all tokens in a prompt contribute equally to the model's output. Stop words, redundant phrases, and verbose formatting often add tokens without improving results. Automated tools can identify and remove these low-impact tokens while preserving the high-impact ones.

Context compression for RAG

In retrieval-augmented generation, retrieved documents often contain relevant and irrelevant information. Rather than including entire documents, extracting only the relevant passages — or using a summarisation step to compress retrieved content — can significantly reduce token usage while maintaining answer quality.

The quality trade-off

Compression always risks losing important nuance. Overly aggressive compression can:

Remove critical context that the model needs
Introduce ambiguity that leads to incorrect interpretations
Strip away examples that the model relies on for task understanding

The key is finding the right balance for each specific application.

Caching as an alternative

Prompt caching — where the AI provider stores processed prompt prefixes and reuses them across requests — achieves similar cost and latency benefits without modifying the prompt content. When available, caching is often preferable to compression because it preserves full prompt quality.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

At scale, prompt length directly impacts AI costs and performance. Understanding compression techniques helps you optimise AI spending without sacrificing output quality — a practical skill that becomes increasingly important as AI usage grows across your organisation.

Related Terms

Token

The smallest unit of text an AI model processes. Roughly 3-4 characters or three-quarters of a word. AI pricing is typically measured in tokens.

Prompt Engineering

The skill of writing instructions to AI that consistently produce useful, accurate, high-quality output.

Prompt Caching

A feature that reuses previously processed prompt content across API calls, reducing latency and cost when the same system prompt or context is sent repeatedly.

Context Window

The maximum amount of text an AI can process at once. Think of it as the AI's working memory — everything it can see and consider when generating a response.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Mastering Prompt Engineering for Work

← Back to Glossary