Distillation
A technique for training a smaller, faster AI model to mimic the behaviour of a larger, more capable model, preserving most of the performance at a fraction of the cost.
Distillation (or knowledge distillation) is a technique for compressing the knowledge of a large, expensive AI model into a smaller, cheaper one. The large model (called the teacher) trains the small model (called the student) to produce similar outputs.
How distillation works
Instead of training the student model on raw data, you train it on the teacher model's outputs. This is surprisingly effective because the teacher's outputs contain richer information than simple labels. When a teacher model says an image is "ninety per cent likely a cat and eight per cent likely a dog," it is teaching the student about the similarity between cats and dogs β information that a hard label ("cat") does not convey.
Why distillation matters
- Cost reduction β a distilled model might be ten times smaller and faster while retaining ninety per cent of the teacher's capability
- Edge deployment β smaller models can run on phones, embedded devices, and environments with limited compute
- Latency β smaller models respond faster, critical for real-time applications
- Budget β API costs are proportional to model size; smaller models cost less per query
Types of distillation
- Output distillation β the student learns to match the teacher's output probabilities
- Feature distillation β the student learns to match the teacher's internal representations at intermediate layers
- Self-distillation β a model distils knowledge from a larger version of itself
Distillation in practice
Many of the efficient AI models you use daily are distilled versions of larger models. Companies often fine-tune and distil large foundation models into specialised models optimised for specific tasks. This is how you get capable AI running on a smartphone.
Limitations
- Distilled models inevitably lose some capability, particularly on edge cases and complex reasoning
- The quality ceiling is the teacher model β a student cannot exceed its teacher
- For highly specialised tasks, training directly on task-specific data may outperform distillation
Why This Matters
Distillation is the practical technique behind affordable AI deployment. Understanding it helps you make informed decisions about model selection: sometimes a smaller, distilled model delivers ninety per cent of the value at ten per cent of the cost, which is the right trade-off for many business applications.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: How LLMs Actually Work