Knowledge Distillation
A technique for training a smaller, faster AI model to replicate the behaviour of a larger, more capable model.
Knowledge distillation is a machine learning technique where a large, well-trained model (the teacher) is used to train a smaller model (the student) that approximates the teacher's performance while being much cheaper and faster to run.
Why not just use the big model?
Large models deliver impressive results, but they are expensive. Running a 70-billion-parameter model for every customer query costs significant compute resources and adds latency. If a 7-billion-parameter model can achieve 95% of the same quality, the savings in cost and speed are enormous β especially at scale.
How distillation works
In standard training, a model learns from labelled data β input and correct answers. In distillation, the student model learns from the teacher's outputs instead. The teacher's outputs contain richer information than simple labels. When a teacher model classifies an image, it does not just say "cat" β it outputs probability distributions across all categories. These "soft targets" carry information about relationships between categories that hard labels miss.
The student model is trained to match these soft probability distributions, effectively learning the teacher's internal reasoning patterns rather than just memorising correct answers.
Common distillation approaches
- Response distillation: The student learns to match the teacher's final output on a large dataset of prompts.
- Feature distillation: The student learns to match the teacher's internal representations at various layers.
- Task-specific distillation: The teacher generates training data for a narrow task, and the student specialises on that task alone.
Real-world applications
- Running AI on mobile devices and edge hardware where large models cannot fit.
- Reducing API costs by using a distilled model for routine queries and reserving the large model for complex ones.
- Creating fast, specialised models for specific business tasks like classification or extraction.
Distillation vs fine-tuning
Fine-tuning adapts an existing model to new data. Distillation creates a new, smaller model that mimics a larger one. They can be combined β you can distil a large model into a small one and then fine-tune the small model on your domain-specific data.
Why This Matters
Knowledge distillation directly affects the cost and speed of AI deployment. Understanding it helps you evaluate whether you need the most expensive model or whether a distilled alternative delivers sufficient quality at a fraction of the price. Many production AI systems use distilled models to keep costs manageable.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: Choosing the Right Model for the Job