Prompt Compression
Techniques for reducing the token count of prompts while preserving their essential meaning, enabling cost savings and fitting more useful context into limited context windows.
Prompt compression is the practice of reducing the number of tokens in a prompt while preserving its essential meaning and effectiveness. As AI model usage scales, the cost and performance impact of long prompts becomes significant, making compression an increasingly valuable technique.
Why prompt compression matters
Every token in a prompt costs money (API pricing is per-token) and time (more tokens mean slower responses). For an application handling millions of requests, even small reductions in prompt length translate to meaningful cost savings and latency improvements.
Additionally, every token used for instructions or context is a token that cannot be used for the model's response or additional context. With limited context windows, efficiency matters.
Compression techniques
- Manual optimisation: Rewriting prompts to be more concise without losing meaning. "Please provide a detailed, comprehensive summary of the following document, making sure to cover all the key points" becomes "Summarise this document, covering all key points."
- Instruction consolidation: Combining multiple separate instructions into unified, denser formats. Instead of five separate rules, create one concise paragraph.
- Example pruning: In few-shot prompts, using the minimum number of examples needed. Often 2-3 well-chosen examples work as well as 5-10.
- Structured formats: Using concise structured formats (JSON, YAML, or shorthand notation) instead of verbose natural language for data and instructions.
- LLMLingua and similar tools: Automated prompt compression tools that use a smaller language model to identify and remove tokens that contribute least to the prompt's meaning.
Token-level compression
Research has shown that not all tokens in a prompt contribute equally to the model's output. Stop words, redundant phrases, and verbose formatting often add tokens without improving results. Automated tools can identify and remove these low-impact tokens while preserving the high-impact ones.
Context compression for RAG
In retrieval-augmented generation, retrieved documents often contain relevant and irrelevant information. Rather than including entire documents, extracting only the relevant passages β or using a summarisation step to compress retrieved content β can significantly reduce token usage while maintaining answer quality.
The quality trade-off
Compression always risks losing important nuance. Overly aggressive compression can:
- Remove critical context that the model needs
- Introduce ambiguity that leads to incorrect interpretations
- Strip away examples that the model relies on for task understanding
The key is finding the right balance for each specific application.
Caching as an alternative
Prompt caching β where the AI provider stores processed prompt prefixes and reuses them across requests β achieves similar cost and latency benefits without modifying the prompt content. When available, caching is often preferable to compression because it preserves full prompt quality.
Why This Matters
At scale, prompt length directly impacts AI costs and performance. Understanding compression techniques helps you optimise AI spending without sacrificing output quality β a practical skill that becomes increasingly important as AI usage grows across your organisation.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Mastering Prompt Engineering for Work
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β