Core AI

Benchmark

Last reviewed: April 2026

A standardised test or dataset used to measure and compare the performance of different AI models on specific tasks.

A benchmark is a standardised test that lets researchers and practitioners compare AI models on a level playing field. Just as school exams test students on the same material, benchmarks test models on the same tasks with the same data.

Why benchmarks exist

Without benchmarks, comparing models would be chaos. Every company could cherry-pick examples where their model excels. Benchmarks provide a common yardstick — a shared dataset and evaluation method that everyone agrees to use.

Common AI benchmarks

MMLU (Massive Multitask Language Understanding) tests knowledge across fifty-seven academic subjects from mathematics to law
HumanEval measures code generation ability by testing whether models can write correct Python functions
HellaSwag tests commonsense reasoning about everyday situations
GSM8K evaluates mathematical reasoning with grade-school maths word problems
MT-Bench assesses conversational ability through multi-turn dialogues judged by another AI

The limitations of benchmarks

Benchmarks have significant shortcomings:

Teaching to the test — models can be optimised specifically for benchmark performance without genuine improvement on real tasks
Data contamination — benchmark questions can leak into training data, inflating scores artificially
Narrow measurement — a model that scores well on academic benchmarks may perform poorly on practical business tasks
Outdated standards — as models improve, older benchmarks become too easy to differentiate top performers

How to use benchmarks wisely

Treat benchmarks as a rough filter, not a definitive ranking. Use them to narrow your shortlist, then test models on your own data and use cases. A model that scores three points lower on MMLU but handles your specific document types better is the right choice for your organisation.

Want to go deeper?

This topic is covered in our Essentials level. Access all 100+ lessons free.

Why This Matters

When vendors claim their AI model is "best in class," they are usually citing benchmarks. Understanding what benchmarks measure — and what they miss — helps you cut through marketing claims and evaluate models based on what actually matters for your business needs.

Related Terms

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Artificial Intelligence (AI)

Software that can perform tasks that normally require human intelligence, such as understanding language, recognising patterns, and making decisions.

Inference

The process of an AI model generating output from your input. Every time you send a prompt and get a response, that is inference.

Learn More

Continue learning in Essentials

This topic is covered in our lesson: Choosing the Right AI Model

← Back to Glossary