Sparse Model
An AI model where only a fraction of parameters activate for each input, making it more efficient than dense models that use all parameters every time.
A sparse model is an AI model where only a subset of its total parameters are activated for any given input. Unlike dense models, which use every parameter for every computation, sparse models selectively activate different parts of the network depending on the input.
Dense vs sparse
A dense model with 100 billion parameters uses all 100 billion for every single token it processes. A sparse model with 100 billion total parameters might only use 10-20 billion per token. The result: you get the knowledge capacity of a 100-billion-parameter model at the computational cost of a much smaller one.
How sparsity works in practice
The most common form of sparsity in modern AI is the mixture-of-experts (MoE) architecture. In an MoE model:
- The model contains multiple expert sub-networks.
- A routing mechanism selects which experts to activate for each input.
- Only the selected experts compute; the rest remain dormant.
- The active experts' outputs are combined to produce the final result.
Other forms of sparsity include:
- Weight pruning: Removing connections (setting weights to zero) that contribute little to the model's output.
- Activation sparsity: Many neurons naturally produce zero outputs for a given input. Sparse implementations skip the computation for these zeros.
- Structured sparsity: Removing entire blocks of neurons or attention heads rather than individual connections.
Benefits of sparsity
- Faster inference: Processing fewer parameters per input means faster response times.
- Lower cost: Less computation per query translates directly to lower serving costs.
- Greater capacity: A sparse model can store more knowledge in its total parameters while remaining affordable to run.
- Specialisation: Different parts of the model can specialise in different types of inputs.
Real-world examples
Mixtral 8x7B is a well-known sparse model with approximately 47 billion total parameters but only about 13 billion active per token. GPT-4 is widely believed to use a sparse MoE architecture, though OpenAI has not confirmed the details.
Trade-offs
- Memory: Even though only some parameters are active, all must be loaded into memory.
- Training complexity: Ensuring balanced utilisation of all sparse components requires careful engineering.
- Quantisation interaction: Sparse models interact with compression techniques differently than dense models.
Why This Matters
Sparse models explain how AI companies deliver high-quality results at manageable costs. Understanding sparsity helps you evaluate model size claims (total parameters vs active parameters) and make informed decisions about performance expectations and deployment requirements.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: Understanding Model Architectures