Core AI

Model Distillation

Last reviewed: April 2026

A technique where a smaller 'student' model is trained to replicate the behaviour of a larger 'teacher' model, producing a compact model that retains most of the original's capability.

Model distillation is a technique for creating smaller, faster, cheaper AI models by training a compact "student" model to mimic the behaviour of a larger, more capable "teacher" model. The student learns not just the correct answers but the teacher's full probability distribution — capturing nuances and confidence levels that simple training data cannot convey.

How distillation works

Traditional training teaches a model from labelled data: "this image is a cat." Distillation provides richer information: "this image is 92% likely a cat, 5% likely a dog, 2% likely a fox, and 1% likely a rabbit." These "soft labels" from the teacher model contain far more information than hard labels, helping the student learn faster and better.

The process:

Teacher training: Train a large, capable model on the full dataset (or use an existing large model).
Soft label generation: Run the training data through the teacher to generate probability distributions for each example.
Student training: Train the smaller model on a combination of the soft labels (from the teacher) and the hard labels (from the original data).
Temperature scaling: The teacher's outputs are often "softened" by increasing the temperature parameter, which makes the probability distributions more informative by spreading probability mass across more classes.

Why distillation produces better small models

A small model trained directly on labelled data receives limited information — just the correct class for each example. A distilled model receives the teacher's complete understanding of each example, including:

How confident the teacher was
Which alternative answers the teacher considered plausible
Subtle relationships between classes that hard labels cannot express

This "dark knowledge" — the information contained in the non-winning classes — is what makes distillation so effective.

Real-world examples

DistilBERT: A distilled version of BERT that retains 97% of BERT's performance while being 60% smaller and 60% faster.
TinyLlama: A small model trained with distillation from larger language models.
Whisper small/tiny: Smaller versions of OpenAI's Whisper speech recognition model.
GPT-4 to GPT-4o-mini: While the exact method is not public, OpenAI's smaller models benefit from knowledge transfer from their larger models.

Distillation versus quantisation

Both techniques produce smaller models, but they work differently:

Quantisation: Reduces the precision of the existing model's numbers. Same architecture, fewer bits per weight.
Distillation: Trains a genuinely smaller architecture to mimic the larger one. Fewer parameters, potentially different architecture.

They can be combined: distil a large model into a smaller one, then quantise the smaller model for even greater efficiency.

Legal and ethical considerations

Some AI providers prohibit using their models' outputs to train competing models — a restriction specifically targeting distillation. OpenAI's terms of service, for instance, restrict this use case. Understanding these restrictions is important for organisations considering distillation as part of their AI strategy.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Model distillation is how the AI industry creates the smaller, cheaper models that make AI economically viable for everyday business use. Understanding distillation helps you evaluate the trade-offs between model size, quality, and cost when choosing AI solutions.

Related Terms

Distillation

A technique for training a smaller, faster AI model to mimic the behaviour of a larger, more capable model, preserving most of the performance at a fraction of the cost.

Knowledge Distillation

A technique for training a smaller, faster AI model to replicate the behaviour of a larger, more capable model.

Quantization

A technique that reduces the precision of an AI model's numerical weights to make it smaller, faster, and cheaper to run.

Small Language Model (SLM)

A language model with fewer parameters (typically under 10 billion) that trades some capability for dramatically lower cost, faster speed, and the ability to run on smaller hardware.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Infrastructure and Deployment

← Back to Glossary