Core AI

Sparse Model

Last reviewed: April 2026

An AI model where only a fraction of parameters activate for each input, making it more efficient than dense models that use all parameters every time.

A sparse model is an AI model where only a subset of its total parameters are activated for any given input. Unlike dense models, which use every parameter for every computation, sparse models selectively activate different parts of the network depending on the input.

Dense vs sparse

A dense model with 100 billion parameters uses all 100 billion for every single token it processes. A sparse model with 100 billion total parameters might only use 10-20 billion per token. The result: you get the knowledge capacity of a 100-billion-parameter model at the computational cost of a much smaller one.

How sparsity works in practice

The most common form of sparsity in modern AI is the mixture-of-experts (MoE) architecture. In an MoE model:

The model contains multiple expert sub-networks.
A routing mechanism selects which experts to activate for each input.
Only the selected experts compute; the rest remain dormant.
The active experts' outputs are combined to produce the final result.

Other forms of sparsity include:

Weight pruning: Removing connections (setting weights to zero) that contribute little to the model's output.
Activation sparsity: Many neurons naturally produce zero outputs for a given input. Sparse implementations skip the computation for these zeros.
Structured sparsity: Removing entire blocks of neurons or attention heads rather than individual connections.

Benefits of sparsity

Faster inference: Processing fewer parameters per input means faster response times.
Lower cost: Less computation per query translates directly to lower serving costs.
Greater capacity: A sparse model can store more knowledge in its total parameters while remaining affordable to run.
Specialisation: Different parts of the model can specialise in different types of inputs.

Real-world examples

Mixtral 8x7B is a well-known sparse model with approximately 47 billion total parameters but only about 13 billion active per token. GPT-4 is widely believed to use a sparse MoE architecture, though OpenAI has not confirmed the details.

Trade-offs

Memory: Even though only some parameters are active, all must be loaded into memory.
Training complexity: Ensuring balanced utilisation of all sparse components requires careful engineering.
Quantisation interaction: Sparse models interact with compression techniques differently than dense models.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Sparse models explain how AI companies deliver high-quality results at manageable costs. Understanding sparsity helps you evaluate model size claims (total parameters vs active parameters) and make informed decisions about performance expectations and deployment requirements.

Related Terms

Mixture of Experts (MoE)

A model architecture where only a subset of the model's parameters activate for each input, making large models faster and cheaper to run.

Parameters

The total number of adjustable values in an AI model. A model with more parameters can capture more complex patterns but requires more computing power to train and run.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Model Weights

The numerical values inside a neural network that determine how it processes information. Weights are what the model learns during training — they encode its knowledge and capabilities.

Scaling Law

The empirical observation that AI model performance improves predictably as you increase model size, training data, and compute — following mathematical power laws.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Understanding Model Architectures

← Back to Glossary