KV Cache (Key-Value Cache)
A memory optimisation technique that stores previously computed attention values during text generation, avoiding redundant calculations and significantly speeding up AI model inference.
A KV cache (key-value cache) is a memory optimisation used during text generation by transformer-based AI models. It stores the intermediate computations from the attention mechanism for all previously processed tokens, so they do not need to be recalculated when generating each new token.
Why the KV cache is needed
When a transformer generates text, it produces one token at a time. For each new token, the attention mechanism needs to consider all previous tokens in the sequence. Without caching, the model would recompute the attention calculations for every previous token at every step β a quadratically growing computational cost.
With a KV cache, the key and value tensors from previous steps are stored in memory. When generating the next token, the model only needs to compute new key-value pairs for the latest token and combine them with the cached values.
The speed improvement
Consider generating a 1,000-token response. Without a KV cache, the model would perform: - Step 1: Process 1 token - Step 2: Process 2 tokens - Step 1,000: Process 1,000 tokens
That is approximately 500,000 total token processings. With a KV cache, each step processes only 1 new token plus a lookup into the cache β approximately 1,000 total token processings. The speedup is enormous.
Memory trade-offs
The KV cache trades memory for speed. For a large model generating a long response, the cache can consume significant GPU memory β sometimes several gigabytes. This creates a practical limit on how many concurrent users a model can serve, because each active conversation requires its own cache.
Optimising the KV cache
Several techniques reduce the memory footprint:
- Quantised KV cache: Storing cached values in lower precision (e.g., FP8 instead of FP16), roughly halving memory usage with minimal quality impact.
- Paged attention: Managing KV cache memory like virtual memory in operating systems, allocating and freeing blocks dynamically. This is the key innovation in the vLLM serving framework.
- Grouped query attention: Sharing key-value heads across multiple query heads, reducing the cache size proportionally.
- Sliding window attention: Only caching the most recent N tokens rather than the entire history, suitable for architectures like Mistral.
Why this matters for AI deployment
The KV cache is one of the primary factors determining how many users an AI deployment can serve simultaneously. Efficient KV cache management directly translates to lower infrastructure costs and faster response times. When evaluating AI serving infrastructure, KV cache handling is one of the first things to assess.
Why This Matters
The KV cache is the reason AI models can generate text at conversational speed. Understanding this mechanism helps you appreciate the memory-speed trade-offs in AI deployment and make informed decisions about infrastructure sizing and cost optimisation.
Related Terms
Continue learning in Expert
This topic is covered in our lesson: Scaling AI Across the Organisation
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β