Small Language Model (SLM)
A language model with fewer parameters (typically under 10 billion) that trades some capability for dramatically lower cost, faster speed, and the ability to run on smaller hardware.
A small language model (SLM) is a language model with relatively few parameters β typically under 10 billion, compared to the hundreds of billions in frontier models. SLMs trade some general capability for dramatically lower cost, faster inference, and the ability to run on consumer hardware.
Why small models matter
Not every task needs the most powerful model. Classifying a support ticket, extracting a date from an email, or generating a one-sentence summary are simple tasks that a small model handles well. Using a 400-billion-parameter model for these tasks is like hiring a brain surgeon to apply a plaster.
The SLM landscape
- Phi (Microsoft): 1-4 billion parameters, surprisingly capable for their size.
- Gemma (Google): 2-9 billion parameters, optimised for on-device use.
- Llama 3.2 (Meta): 1-3 billion parameter variants designed for mobile and edge.
- Qwen 2.5 (Alibaba): Multiple small variants for diverse tasks.
- Mistral 7B: A well-regarded 7-billion-parameter model.
When to use small models
- High-volume, simple tasks: Classification, extraction, routing, and formatting where cost at scale matters.
- On-device deployment: Running AI on phones, laptops, or IoT devices where large models cannot fit.
- Low-latency requirements: When response time is critical and every millisecond counts.
- Privacy-sensitive applications: Running locally means data never leaves the device.
- Cost optimisation: A 3-billion-parameter model can be 100x cheaper per query than a frontier model.
The quality question
Small models are remarkably capable on focused tasks, especially after fine-tuning. A fine-tuned 7B model for a specific task often matches or exceeds a general-purpose frontier model on that task. The key is matching model size to task complexity.
Where small models clearly fall short:
- Complex multi-step reasoning.
- Tasks requiring broad world knowledge.
- Creative writing requiring nuance and originality.
- Following long, complex instructions with many constraints.
The model routing pattern
Many production systems use a router that sends simple queries to small, cheap models and complex queries to large, capable models. This optimises both cost and quality β you get the best model for each specific task.
Why This Matters
Small language models make AI economically viable at scale. Understanding when a small model suffices versus when you need a frontier model is one of the most impactful cost decisions in AI deployment. The right model for the job is often not the biggest one.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Choosing the Right Model for the Job