Core AI

Distillation

Last reviewed: April 2026

A technique for training a smaller, faster AI model to mimic the behaviour of a larger, more capable model, preserving most of the performance at a fraction of the cost.

Distillation (or knowledge distillation) is a technique for compressing the knowledge of a large, expensive AI model into a smaller, cheaper one. The large model (called the teacher) trains the small model (called the student) to produce similar outputs.

How distillation works

Instead of training the student model on raw data, you train it on the teacher model's outputs. This is surprisingly effective because the teacher's outputs contain richer information than simple labels. When a teacher model says an image is "ninety per cent likely a cat and eight per cent likely a dog," it is teaching the student about the similarity between cats and dogs — information that a hard label ("cat") does not convey.

Why distillation matters

Cost reduction — a distilled model might be ten times smaller and faster while retaining ninety per cent of the teacher's capability
Edge deployment — smaller models can run on phones, embedded devices, and environments with limited compute
Latency — smaller models respond faster, critical for real-time applications
Budget — API costs are proportional to model size; smaller models cost less per query

Types of distillation

Output distillation — the student learns to match the teacher's output probabilities
Feature distillation — the student learns to match the teacher's internal representations at intermediate layers
Self-distillation — a model distils knowledge from a larger version of itself

Distillation in practice

Many of the efficient AI models you use daily are distilled versions of larger models. Companies often fine-tune and distil large foundation models into specialised models optimised for specific tasks. This is how you get capable AI running on a smartphone.

Limitations

Distilled models inevitably lose some capability, particularly on edge cases and complex reasoning
The quality ceiling is the teacher model — a student cannot exceed its teacher
For highly specialised tasks, training directly on task-specific data may outperform distillation

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Distillation is the practical technique behind affordable AI deployment. Understanding it helps you make informed decisions about model selection: sometimes a smaller, distilled model delivers ninety per cent of the value at ten per cent of the cost, which is the right trade-off for many business applications.

Related Terms

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Fine-Tuning

Training an existing AI model on your specific data to improve its performance on your specific tasks. Like giving the AI specialised on-the-job training.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Model Weights

The numerical values inside a neural network that determine how it processes information. Weights are what the model learns during training — they encode its knowledge and capabilities.

Parameters

The total number of adjustable values in an AI model. A model with more parameters can capture more complex patterns but requires more computing power to train and run.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: How LLMs Actually Work

← Back to Glossary