Skip to main content
Early access β€” new tools and guides added regularly
Practical

Model Compression

Last reviewed: April 2026

A set of techniques for reducing the size and computational requirements of AI models while preserving as much of their capability as possible.

Model compression refers to techniques that make AI models smaller, faster, and cheaper to run while maintaining acceptable performance. It is essential for deploying capable models in resource-constrained environments and for reducing the cost of AI at scale.

Why compress models?

Large language models can have hundreds of billions of parameters, requiring expensive GPU clusters to run. Model compression enables running AI on cheaper hardware, reducing cloud computing costs, deploying models on edge devices (phones, laptops, IoT), lowering latency for time-sensitive applications, and reducing energy consumption.

Key compression techniques

  • Quantization: Reducing the precision of model weights from 16-bit or 32-bit floating point to 8-bit, 4-bit, or even 2-bit integers. This can shrink model size by 4-8x with modest quality loss.
  • Pruning: Removing weights, neurons, or entire layers that contribute least to the model's output. Structured pruning removes entire channels or attention heads; unstructured pruning zeros out individual weights.
  • Knowledge distillation: Training a smaller "student" model to mimic the behaviour of the larger "teacher" model. The student learns the teacher's output patterns rather than training from raw data.
  • Low-rank factorization: Decomposing large weight matrices into products of smaller matrices, reducing parameter count while approximating the original computation.
  • Weight sharing: Having multiple parts of the model use the same set of weights, reducing total parameters.

Compression in practice

Most consumer-facing AI applications use compressed models. When you run an AI model on your phone or laptop, it has almost certainly been quantized and possibly pruned. Open-source communities routinely create 4-bit quantized versions of large models that run on consumer GPUs, making capable AI accessible without expensive hardware.

Trade-offs

Compression always involves trade-offs. Aggressive compression can degrade performance on complex reasoning tasks while maintaining quality on simpler ones. The right level of compression depends on your use case β€” a customer service chatbot may tolerate more compression than a medical diagnosis system.

Want to go deeper?
This topic is covered in our Practitioner level. Access all 60+ lessons free.

Why This Matters

Model compression is what makes AI accessible and affordable. Understanding compression techniques helps you evaluate whether you need the largest model or whether a compressed version delivers sufficient quality at a fraction of the cost and infrastructure requirements.

Related Terms

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: The Economics of AI