Model Distillation
A technique where a smaller 'student' model is trained to replicate the behaviour of a larger 'teacher' model, producing a compact model that retains most of the original's capability.
Model distillation is a technique for creating smaller, faster, cheaper AI models by training a compact "student" model to mimic the behaviour of a larger, more capable "teacher" model. The student learns not just the correct answers but the teacher's full probability distribution β capturing nuances and confidence levels that simple training data cannot convey.
How distillation works
Traditional training teaches a model from labelled data: "this image is a cat." Distillation provides richer information: "this image is 92% likely a cat, 5% likely a dog, 2% likely a fox, and 1% likely a rabbit." These "soft labels" from the teacher model contain far more information than hard labels, helping the student learn faster and better.
The process:
- Teacher training: Train a large, capable model on the full dataset (or use an existing large model).
- Soft label generation: Run the training data through the teacher to generate probability distributions for each example.
- Student training: Train the smaller model on a combination of the soft labels (from the teacher) and the hard labels (from the original data).
- Temperature scaling: The teacher's outputs are often "softened" by increasing the temperature parameter, which makes the probability distributions more informative by spreading probability mass across more classes.
Why distillation produces better small models
A small model trained directly on labelled data receives limited information β just the correct class for each example. A distilled model receives the teacher's complete understanding of each example, including:
- How confident the teacher was
- Which alternative answers the teacher considered plausible
- Subtle relationships between classes that hard labels cannot express
This "dark knowledge" β the information contained in the non-winning classes β is what makes distillation so effective.
Real-world examples
- DistilBERT: A distilled version of BERT that retains 97% of BERT's performance while being 60% smaller and 60% faster.
- TinyLlama: A small model trained with distillation from larger language models.
- Whisper small/tiny: Smaller versions of OpenAI's Whisper speech recognition model.
- GPT-4 to GPT-4o-mini: While the exact method is not public, OpenAI's smaller models benefit from knowledge transfer from their larger models.
Distillation versus quantisation
Both techniques produce smaller models, but they work differently:
- Quantisation: Reduces the precision of the existing model's numbers. Same architecture, fewer bits per weight.
- Distillation: Trains a genuinely smaller architecture to mimic the larger one. Fewer parameters, potentially different architecture.
They can be combined: distil a large model into a smaller one, then quantise the smaller model for even greater efficiency.
Legal and ethical considerations
Some AI providers prohibit using their models' outputs to train competing models β a restriction specifically targeting distillation. OpenAI's terms of service, for instance, restrict this use case. Understanding these restrictions is important for organisations considering distillation as part of their AI strategy.
Why This Matters
Model distillation is how the AI industry creates the smaller, cheaper models that make AI economically viable for everyday business use. Understanding distillation helps you evaluate the trade-offs between model size, quality, and cost when choosing AI solutions.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: AI Infrastructure and Deployment
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β