Core AI

Mixture of Experts (MoE)

Last reviewed: April 2026

A model architecture where only a subset of the model's parameters activate for each input, making large models faster and cheaper to run.

Mixture of experts (MoE) is a neural network architecture where the model is divided into multiple specialised sub-networks called experts. For any given input, only a small number of these experts are activated — meaning the model can be very large in total parameters but fast in practice because it only uses a fraction of them at a time.

The core idea

Imagine a hospital with fifty specialist doctors. When a patient arrives, a triage nurse does not send them to all fifty — they route the patient to the two or three most relevant specialists. MoE works the same way. A routing mechanism (the "gating network") examines each input and selects the most relevant experts to process it.

Why MoE matters for performance

A traditional dense model activates all its parameters for every input. A 100-billion-parameter dense model processes 100 billion parameters per request. An MoE model with 100 billion total parameters might only activate 10-20 billion per request, dramatically reducing computation while maintaining quality.

This means MoE models can be much larger in total capacity without proportionally increasing the cost of each query. This is believed to be the architecture behind models like GPT-4 and Mixtral.

How routing works

The gating network is a small neural network that takes the input and outputs a probability distribution over the available experts. The top-k experts (usually 2) are selected, and their outputs are combined based on the routing probabilities. The gating network is trained alongside the experts, learning which experts are best for which types of input.

Advantages

Better performance per compute dollar: You get the quality benefits of a large model at the inference cost of a smaller one.
Specialisation: Different experts naturally develop expertise in different areas — one might excel at code, another at creative writing, another at reasoning.
Scalability: You can increase model capacity by adding more experts without proportionally increasing serving costs.

Challenges

Training complexity: Ensuring all experts are used (avoiding "expert collapse" where the router always picks the same few).
Memory requirements: Even though only some experts are active, all must be stored in memory.
Load balancing: In distributed systems, ensuring even utilisation of hardware resources.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

MoE architecture explains why some AI models seem disproportionately capable relative to their speed and cost. Understanding this helps you evaluate model pricing and performance claims, and appreciate why model size alone is not a reliable indicator of quality or cost.

Related Terms

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Parameters

The total number of adjustable values in an AI model. A model with more parameters can capture more complex patterns but requires more computing power to train and run.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Sparse Model

An AI model where only a fraction of parameters activate for each input, making it more efficient than dense models that use all parameters every time.

Scaling Law

The empirical observation that AI model performance improves predictably as you increase model size, training data, and compute — following mathematical power laws.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Understanding Model Architectures

← Back to Glossary