Skip to main content
Early access β€” new tools and guides added regularly
Core AI

Quantization

Last reviewed: April 2026

A technique that reduces the precision of an AI model's numerical weights to make it smaller, faster, and cheaper to run.

Quantization is a compression technique that reduces the size of an AI model by lowering the precision of the numbers it stores internally. Instead of using high-precision 32-bit floating point numbers for every weight, a quantized model might use 16-bit, 8-bit, or even 4-bit representations.

Why models need quantization

Large language models contain billions of parameters, each stored as a number. A 70-billion parameter model using 32-bit precision requires approximately 280 gigabytes of memory just to load β€” far more than most hardware can handle. Quantization shrinks that footprint dramatically. An 8-bit version of the same model needs roughly 70 gigabytes. A 4-bit version needs around 35 gigabytes.

How it works

Think of it like reducing the resolution of an image. A full-resolution photo captures every subtle shade of colour. A compressed version uses fewer colour values but still looks nearly identical to the human eye. Similarly, quantized models use fewer numerical precision levels but produce nearly identical outputs.

The main approaches include:

  • Post-training quantization: Take a fully trained model and convert its weights to lower precision. This is the simplest method and requires no additional training.
  • Quantization-aware training: Train the model while simulating lower precision, so the model learns to perform well despite the reduced accuracy.
  • Mixed precision: Use lower precision for most layers but keep higher precision for layers where accuracy matters most.

The quality trade-off

Quantization is not free. Reducing precision means some information is lost. The question is whether that loss matters in practice. Research consistently shows that 8-bit quantization produces negligible quality loss for most tasks. Even 4-bit quantization retains surprisingly strong performance, though very demanding tasks like complex reasoning or code generation may see minor degradation.

Why this matters for running AI locally

Quantization is what makes it possible to run powerful AI models on consumer hardware. Without it, running a large language model would require server-grade GPUs costing thousands of pounds. With aggressive quantization, many capable models can run on a laptop.

Business implications

For organisations deploying AI at scale, quantization directly reduces infrastructure costs. Serving a quantized model requires fewer GPUs, less memory, and less energy β€” translating to lower cloud computing bills and faster response times.

Want to go deeper?
This topic is covered in our Advanced level. Access all 60+ lessons free.

Why This Matters

Quantization is one of the most practical cost-reduction techniques in AI deployment. It allows organisations to run larger, more capable models on less expensive hardware without significant quality loss. Understanding quantization helps you evaluate AI infrastructure costs and make informed decisions about self-hosting versus cloud deployment.

Related Terms

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Infrastructure and Deployment