Core AI

KV Cache (Key-Value Cache)

Last reviewed: April 2026

A memory optimisation technique that stores previously computed attention values during text generation, avoiding redundant calculations and significantly speeding up AI model inference.

A KV cache (key-value cache) is a memory optimisation used during text generation by transformer-based AI models. It stores the intermediate computations from the attention mechanism for all previously processed tokens, so they do not need to be recalculated when generating each new token.

Why the KV cache is needed

When a transformer generates text, it produces one token at a time. For each new token, the attention mechanism needs to consider all previous tokens in the sequence. Without caching, the model would recompute the attention calculations for every previous token at every step — a quadratically growing computational cost.

With a KV cache, the key and value tensors from previous steps are stored in memory. When generating the next token, the model only needs to compute new key-value pairs for the latest token and combine them with the cached values.

The speed improvement

Consider generating a 1,000-token response. Without a KV cache, the model would perform: - Step 1: Process 1 token - Step 2: Process 2 tokens - Step 1,000: Process 1,000 tokens

That is approximately 500,000 total token processings. With a KV cache, each step processes only 1 new token plus a lookup into the cache — approximately 1,000 total token processings. The speedup is enormous.

Memory trade-offs

The KV cache trades memory for speed. For a large model generating a long response, the cache can consume significant GPU memory — sometimes several gigabytes. This creates a practical limit on how many concurrent users a model can serve, because each active conversation requires its own cache.

Optimising the KV cache

Several techniques reduce the memory footprint:

Quantised KV cache: Storing cached values in lower precision (e.g., FP8 instead of FP16), roughly halving memory usage with minimal quality impact.
Paged attention: Managing KV cache memory like virtual memory in operating systems, allocating and freeing blocks dynamically. This is the key innovation in the vLLM serving framework.
Grouped query attention: Sharing key-value heads across multiple query heads, reducing the cache size proportionally.
Sliding window attention: Only caching the most recent N tokens rather than the entire history, suitable for architectures like Mistral.

Why this matters for AI deployment

The KV cache is one of the primary factors determining how many users an AI deployment can serve simultaneously. Efficient KV cache management directly translates to lower infrastructure costs and faster response times. When evaluating AI serving infrastructure, KV cache handling is one of the first things to assess.

Want to go deeper?

This topic is covered in our Expert level. Access all 100+ lessons free.

Why This Matters

The KV cache is the reason AI models can generate text at conversational speed. Understanding this mechanism helps you appreciate the memory-speed trade-offs in AI deployment and make informed decisions about infrastructure sizing and cost optimisation.

Related Terms

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Attention Mechanism

A technique that lets AI models focus on the most relevant parts of their input when generating each piece of output, forming the core of transformer architecture.

Latency

The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.

Learn More

Continue learning in Expert

This topic is covered in our lesson: Scaling AI Across the Organisation

← Back to Glossary