Core AI

Small Language Model (SLM)

Last reviewed: April 2026

A language model with fewer parameters (typically under 10 billion) that trades some capability for dramatically lower cost, faster speed, and the ability to run on smaller hardware.

A small language model (SLM) is a language model with relatively few parameters — typically under 10 billion, compared to the hundreds of billions in frontier models. SLMs trade some general capability for dramatically lower cost, faster inference, and the ability to run on consumer hardware.

Why small models matter

Not every task needs the most powerful model. Classifying a support ticket, extracting a date from an email, or generating a one-sentence summary are simple tasks that a small model handles well. Using a 400-billion-parameter model for these tasks is like hiring a brain surgeon to apply a plaster.

The SLM landscape

Phi (Microsoft): 1-4 billion parameters, surprisingly capable for their size.
Gemma (Google): 2-9 billion parameters, optimised for on-device use.
Llama 3.2 (Meta): 1-3 billion parameter variants designed for mobile and edge.
Qwen 2.5 (Alibaba): Multiple small variants for diverse tasks.
Mistral 7B: A well-regarded 7-billion-parameter model.

When to use small models

High-volume, simple tasks: Classification, extraction, routing, and formatting where cost at scale matters.
On-device deployment: Running AI on phones, laptops, or IoT devices where large models cannot fit.
Low-latency requirements: When response time is critical and every millisecond counts.
Privacy-sensitive applications: Running locally means data never leaves the device.
Cost optimisation: A 3-billion-parameter model can be 100x cheaper per query than a frontier model.

The quality question

Small models are remarkably capable on focused tasks, especially after fine-tuning. A fine-tuned 7B model for a specific task often matches or exceeds a general-purpose frontier model on that task. The key is matching model size to task complexity.

Where small models clearly fall short:

Complex multi-step reasoning.
Tasks requiring broad world knowledge.
Creative writing requiring nuance and originality.
Following long, complex instructions with many constraints.

The model routing pattern

Many production systems use a router that sends simple queries to small, cheap models and complex queries to large, capable models. This optimises both cost and quality — you get the best model for each specific task.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Small language models make AI economically viable at scale. Understanding when a small model suffices versus when you need a frontier model is one of the most impactful cost decisions in AI deployment. The right model for the job is often not the biggest one.

Related Terms

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Knowledge Distillation

A technique for training a smaller, faster AI model to replicate the behaviour of a larger, more capable model.

On-Device AI

AI models that run directly on your phone, laptop, or other hardware rather than in the cloud, offering faster responses and greater privacy.

LoRA (Low-Rank Adaptation)

A fine-tuning technique that trains only a small number of additional parameters instead of updating the entire model, making customisation faster and cheaper.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Parameters

The total number of adjustable values in an AI model. A model with more parameters can capture more complex patterns but requires more computing power to train and run.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Choosing the Right Model for the Job

← Back to Glossary