Core AI

Speculative Decoding

Last reviewed: April 2026

An inference acceleration technique where a smaller, faster model drafts text that a larger model then verifies, significantly speeding up generation without sacrificing quality.

Speculative decoding is a technique for accelerating text generation from large language models. It works by using a smaller, faster model to draft candidate tokens, which a larger model then verifies in parallel — producing the same output as the large model alone, but significantly faster.

The bottleneck it addresses

Large language models generate text one token at a time. Each token requires a full forward pass through the entire model. For very large models, each forward pass takes significant time, and generating a 500-token response means 500 sequential forward passes. This sequential nature is the primary bottleneck in LLM inference speed.

How speculative decoding works

Draft phase: A smaller, faster model (the "draft model") generates several candidate tokens quickly — typically 3-8 tokens at a time.
Verification phase: The large model processes all draft tokens in a single forward pass, checking whether it would have generated the same tokens.
Accept or reject: Tokens that match (or are statistically acceptable) are kept. The first rejected token is replaced with the large model's choice.
Repeat: The process continues from the last accepted token.

The key insight is that verification is much faster than generation. While generating N tokens requires N forward passes, verifying N tokens requires only one forward pass (because the model processes all tokens in parallel during verification, just as it would during prompt processing).

Why it works so well

For many tokens — especially common words, function words, and predictable continuations — the small model and large model agree. Research shows agreement rates of 70-90% for well-chosen draft models. This means the large model effectively "generates" multiple tokens per forward pass, with the small model handling the easy tokens and the large model intervening only when it disagrees.

Performance gains

Speculative decoding typically provides a 2-3x speedup in token generation with no quality loss. The exact speedup depends on:

How well the draft model approximates the target model
The nature of the text being generated (predictable text benefits more)
The size ratio between draft and target models

Requirements and trade-offs

Draft model selection: The draft model must be fast enough to provide a genuine speedup and similar enough to the target model to achieve a high acceptance rate.
Memory overhead: Both models must be in memory simultaneously, increasing total memory requirements.
Implementation complexity: Speculative decoding adds complexity to the inference pipeline.
Guaranteed quality: Mathematically, speculative decoding produces exactly the same output distribution as the target model alone — there is no quality-accuracy trade-off.

Industry adoption

Speculative decoding is increasingly used by major AI providers to reduce inference costs and latency. It is available in inference frameworks like vLLM and HuggingFace TGI, and is used internally by AI API providers to deliver faster responses without reducing model quality.

Want to go deeper?

This topic is covered in our Expert level. Access all 100+ lessons free.

Why This Matters

Speculative decoding is one of the most impactful techniques for making AI faster and cheaper to deploy. Understanding it helps you appreciate why some AI providers can offer faster responses without sacrificing quality, and evaluate infrastructure claims from AI vendors.

Related Terms

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Latency

The time delay between sending a request to an AI model and receiving the first part of the response. Lower latency means faster, more responsive AI interactions.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Throughput

The volume of data an AI system can process in a given time period — typically measured in tokens per second or requests per minute. Higher throughput means more work done faster.

Learn More

Continue learning in Expert

This topic is covered in our lesson: Scaling AI Across the Organisation

← Back to Glossary