Skip to main content
Early access β€” new tools and guides added regularly
Core AI

Mixture of Experts (MoE)

Last reviewed: April 2026

A model architecture where only a subset of the model's parameters activate for each input, making large models faster and cheaper to run.

Mixture of experts (MoE) is a neural network architecture where the model is divided into multiple specialised sub-networks called experts. For any given input, only a small number of these experts are activated β€” meaning the model can be very large in total parameters but fast in practice because it only uses a fraction of them at a time.

The core idea

Imagine a hospital with fifty specialist doctors. When a patient arrives, a triage nurse does not send them to all fifty β€” they route the patient to the two or three most relevant specialists. MoE works the same way. A routing mechanism (the "gating network") examines each input and selects the most relevant experts to process it.

Why MoE matters for performance

A traditional dense model activates all its parameters for every input. A 100-billion-parameter dense model processes 100 billion parameters per request. An MoE model with 100 billion total parameters might only activate 10-20 billion per request, dramatically reducing computation while maintaining quality.

This means MoE models can be much larger in total capacity without proportionally increasing the cost of each query. This is believed to be the architecture behind models like GPT-4 and Mixtral.

How routing works

The gating network is a small neural network that takes the input and outputs a probability distribution over the available experts. The top-k experts (usually 2) are selected, and their outputs are combined based on the routing probabilities. The gating network is trained alongside the experts, learning which experts are best for which types of input.

Advantages

  • Better performance per compute dollar: You get the quality benefits of a large model at the inference cost of a smaller one.
  • Specialisation: Different experts naturally develop expertise in different areas β€” one might excel at code, another at creative writing, another at reasoning.
  • Scalability: You can increase model capacity by adding more experts without proportionally increasing serving costs.

Challenges

  • Training complexity: Ensuring all experts are used (avoiding "expert collapse" where the router always picks the same few).
  • Memory requirements: Even though only some experts are active, all must be stored in memory.
  • Load balancing: In distributed systems, ensuring even utilisation of hardware resources.
Want to go deeper?
This topic is covered in our Advanced level. Access all 60+ lessons free.

Why This Matters

MoE architecture explains why some AI models seem disproportionately capable relative to their speed and cost. Understanding this helps you evaluate model pricing and performance claims, and appreciate why model size alone is not a reliable indicator of quality or cost.

Related Terms

Learn More

Continue learning in Advanced

This topic is covered in our lesson: Understanding Model Architectures