Speculative Decoding
An inference acceleration technique where a smaller, faster model drafts text that a larger model then verifies, significantly speeding up generation without sacrificing quality.
Speculative decoding is a technique for accelerating text generation from large language models. It works by using a smaller, faster model to draft candidate tokens, which a larger model then verifies in parallel β producing the same output as the large model alone, but significantly faster.
The bottleneck it addresses
Large language models generate text one token at a time. Each token requires a full forward pass through the entire model. For very large models, each forward pass takes significant time, and generating a 500-token response means 500 sequential forward passes. This sequential nature is the primary bottleneck in LLM inference speed.
How speculative decoding works
- Draft phase: A smaller, faster model (the "draft model") generates several candidate tokens quickly β typically 3-8 tokens at a time.
- Verification phase: The large model processes all draft tokens in a single forward pass, checking whether it would have generated the same tokens.
- Accept or reject: Tokens that match (or are statistically acceptable) are kept. The first rejected token is replaced with the large model's choice.
- Repeat: The process continues from the last accepted token.
The key insight is that verification is much faster than generation. While generating N tokens requires N forward passes, verifying N tokens requires only one forward pass (because the model processes all tokens in parallel during verification, just as it would during prompt processing).
Why it works so well
For many tokens β especially common words, function words, and predictable continuations β the small model and large model agree. Research shows agreement rates of 70-90% for well-chosen draft models. This means the large model effectively "generates" multiple tokens per forward pass, with the small model handling the easy tokens and the large model intervening only when it disagrees.
Performance gains
Speculative decoding typically provides a 2-3x speedup in token generation with no quality loss. The exact speedup depends on:
- How well the draft model approximates the target model
- The nature of the text being generated (predictable text benefits more)
- The size ratio between draft and target models
Requirements and trade-offs
- Draft model selection: The draft model must be fast enough to provide a genuine speedup and similar enough to the target model to achieve a high acceptance rate.
- Memory overhead: Both models must be in memory simultaneously, increasing total memory requirements.
- Implementation complexity: Speculative decoding adds complexity to the inference pipeline.
- Guaranteed quality: Mathematically, speculative decoding produces exactly the same output distribution as the target model alone β there is no quality-accuracy trade-off.
Industry adoption
Speculative decoding is increasingly used by major AI providers to reduce inference costs and latency. It is available in inference frameworks like vLLM and HuggingFace TGI, and is used internally by AI API providers to deliver faster responses without reducing model quality.
Why This Matters
Speculative decoding is one of the most impactful techniques for making AI faster and cheaper to deploy. Understanding it helps you appreciate why some AI providers can offer faster responses without sacrificing quality, and evaluate infrastructure claims from AI vendors.
Related Terms
Continue learning in Expert
This topic is covered in our lesson: Scaling AI Across the Organisation
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β