Quantization
A technique that reduces the precision of an AI model's numerical weights to make it smaller, faster, and cheaper to run.
Quantization is a compression technique that reduces the size of an AI model by lowering the precision of the numbers it stores internally. Instead of using high-precision 32-bit floating point numbers for every weight, a quantized model might use 16-bit, 8-bit, or even 4-bit representations.
Why models need quantization
Large language models contain billions of parameters, each stored as a number. A 70-billion parameter model using 32-bit precision requires approximately 280 gigabytes of memory just to load β far more than most hardware can handle. Quantization shrinks that footprint dramatically. An 8-bit version of the same model needs roughly 70 gigabytes. A 4-bit version needs around 35 gigabytes.
How it works
Think of it like reducing the resolution of an image. A full-resolution photo captures every subtle shade of colour. A compressed version uses fewer colour values but still looks nearly identical to the human eye. Similarly, quantized models use fewer numerical precision levels but produce nearly identical outputs.
The main approaches include:
- Post-training quantization: Take a fully trained model and convert its weights to lower precision. This is the simplest method and requires no additional training.
- Quantization-aware training: Train the model while simulating lower precision, so the model learns to perform well despite the reduced accuracy.
- Mixed precision: Use lower precision for most layers but keep higher precision for layers where accuracy matters most.
The quality trade-off
Quantization is not free. Reducing precision means some information is lost. The question is whether that loss matters in practice. Research consistently shows that 8-bit quantization produces negligible quality loss for most tasks. Even 4-bit quantization retains surprisingly strong performance, though very demanding tasks like complex reasoning or code generation may see minor degradation.
Why this matters for running AI locally
Quantization is what makes it possible to run powerful AI models on consumer hardware. Without it, running a large language model would require server-grade GPUs costing thousands of pounds. With aggressive quantization, many capable models can run on a laptop.
Business implications
For organisations deploying AI at scale, quantization directly reduces infrastructure costs. Serving a quantized model requires fewer GPUs, less memory, and less energy β translating to lower cloud computing bills and faster response times.
Why This Matters
Quantization is one of the most practical cost-reduction techniques in AI deployment. It allows organisations to run larger, more capable models on less expensive hardware without significant quality loss. Understanding quantization helps you evaluate AI infrastructure costs and make informed decisions about self-hosting versus cloud deployment.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: AI Infrastructure and Deployment