Core AI

Knowledge Distillation

Last reviewed: April 2026

A technique for training a smaller, faster AI model to replicate the behaviour of a larger, more capable model.

Knowledge distillation is a machine learning technique where a large, well-trained model (the teacher) is used to train a smaller model (the student) that approximates the teacher's performance while being much cheaper and faster to run.

Why not just use the big model?

Large models deliver impressive results, but they are expensive. Running a 70-billion-parameter model for every customer query costs significant compute resources and adds latency. If a 7-billion-parameter model can achieve 95% of the same quality, the savings in cost and speed are enormous — especially at scale.

How distillation works

In standard training, a model learns from labelled data — input and correct answers. In distillation, the student model learns from the teacher's outputs instead. The teacher's outputs contain richer information than simple labels. When a teacher model classifies an image, it does not just say "cat" — it outputs probability distributions across all categories. These "soft targets" carry information about relationships between categories that hard labels miss.

The student model is trained to match these soft probability distributions, effectively learning the teacher's internal reasoning patterns rather than just memorising correct answers.

Common distillation approaches

Response distillation: The student learns to match the teacher's final output on a large dataset of prompts.
Feature distillation: The student learns to match the teacher's internal representations at various layers.
Task-specific distillation: The teacher generates training data for a narrow task, and the student specialises on that task alone.

Real-world applications

Running AI on mobile devices and edge hardware where large models cannot fit.
Reducing API costs by using a distilled model for routine queries and reserving the large model for complex ones.
Creating fast, specialised models for specific business tasks like classification or extraction.

Distillation vs fine-tuning

Fine-tuning adapts an existing model to new data. Distillation creates a new, smaller model that mimics a larger one. They can be combined — you can distil a large model into a small one and then fine-tune the small model on your domain-specific data.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Knowledge distillation directly affects the cost and speed of AI deployment. Understanding it helps you evaluate whether you need the most expensive model or whether a distilled alternative delivers sufficient quality at a fraction of the price. Many production AI systems use distilled models to keep costs manageable.

Related Terms

Model Fine-Tuning

The process of further training a pre-trained AI model on your own data so it performs better on your specific tasks.

Small Language Model (SLM)

A language model with fewer parameters (typically under 10 billion) that trades some capability for dramatically lower cost, faster speed, and the ability to run on smaller hardware.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Parameters

The total number of adjustable values in an AI model. A model with more parameters can capture more complex patterns but requires more computing power to train and run.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Choosing the Right Model for the Job

← Back to Glossary