Practical

Model Compression

Last reviewed: April 2026

A set of techniques for reducing the size and computational requirements of AI models while preserving as much of their capability as possible.

Model compression refers to techniques that make AI models smaller, faster, and cheaper to run while maintaining acceptable performance. It is essential for deploying capable models in resource-constrained environments and for reducing the cost of AI at scale.

Why compress models?

Large language models can have hundreds of billions of parameters, requiring expensive GPU clusters to run. Model compression enables running AI on cheaper hardware, reducing cloud computing costs, deploying models on edge devices (phones, laptops, IoT), lowering latency for time-sensitive applications, and reducing energy consumption.

Key compression techniques

Quantization: Reducing the precision of model weights from 16-bit or 32-bit floating point to 8-bit, 4-bit, or even 2-bit integers. This can shrink model size by 4-8x with modest quality loss.
Pruning: Removing weights, neurons, or entire layers that contribute least to the model's output. Structured pruning removes entire channels or attention heads; unstructured pruning zeros out individual weights.
Knowledge distillation: Training a smaller "student" model to mimic the behaviour of the larger "teacher" model. The student learns the teacher's output patterns rather than training from raw data.
Low-rank factorization: Decomposing large weight matrices into products of smaller matrices, reducing parameter count while approximating the original computation.
Weight sharing: Having multiple parts of the model use the same set of weights, reducing total parameters.

Compression in practice

Most consumer-facing AI applications use compressed models. When you run an AI model on your phone or laptop, it has almost certainly been quantized and possibly pruned. Open-source communities routinely create 4-bit quantized versions of large models that run on consumer GPUs, making capable AI accessible without expensive hardware.

Trade-offs

Compression always involves trade-offs. Aggressive compression can degrade performance on complex reasoning tasks while maintaining quality on simpler ones. The right level of compression depends on your use case — a customer service chatbot may tolerate more compression than a medical diagnosis system.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Model compression is what makes AI accessible and affordable. Understanding compression techniques helps you evaluate whether you need the largest model or whether a compressed version delivers sufficient quality at a fraction of the cost and infrastructure requirements.

Related Terms

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Neural Network

A computing system loosely inspired by the human brain, made of layers of interconnected nodes that learn to recognise patterns in data.

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: The Economics of AI

← Back to Glossary