Practical

Evals (AI Evaluations)

Last reviewed: April 2026

Systematic tests used to measure AI model performance across specific capabilities, safety criteria, or business requirements — the AI equivalent of a software test suite.

Evals — short for evaluations — are systematic tests that measure how well an AI model performs on specific tasks, capabilities, or safety criteria. They are the AI equivalent of a software test suite: a structured way to verify that a system works correctly before and during production deployment.

Why evals matter

Unlike traditional software where correctness is often binary (the function returns the right value or it does not), AI model quality is nuanced and multidimensional. A model might be excellent at summarisation but mediocre at reasoning. It might handle English brilliantly but struggle with Welsh. Without systematic evaluation, you do not know what you are deploying.

Types of evals

Capability evals: Measure what the model can do — reasoning, coding, mathematics, language understanding, creative writing.
Safety evals: Test whether the model refuses harmful requests, avoids bias, and behaves appropriately in sensitive contexts.
Domain evals: Assess performance on specific industry tasks — legal document analysis, medical question answering, financial analysis.
Regression evals: Verify that changes (model updates, prompt modifications, new features) do not break existing capabilities.
User experience evals: Measure qualities like helpfulness, clarity, tone, and formatting from the end user's perspective.

Building effective evals

A good eval system includes:

Test cases: A curated set of inputs with known expected outputs or evaluation criteria.
Evaluation method: How to score each response — exact match, rubric-based scoring, human judgement, or LLM-as-judge.
Metrics: Numerical scores that summarise performance across the test suite.
Baselines: Previous scores to compare against, enabling tracking of improvement or regression.
Automation: The ability to run evals automatically when prompts, models, or pipelines change.

LLM-as-judge

A common modern approach uses a powerful language model to evaluate the outputs of another model. The judge model receives the input, the output, and evaluation criteria, then produces a score and explanation. This scales far better than human evaluation while correlating reasonably well with human judgement.

Common mistakes in evals

Teaching to the test: Optimising for eval performance rather than real-world performance. High eval scores do not always mean the model works well in practice.
Static evals: Running the same eval suite without updating it. As the model improves, evals need to evolve to remain discriminating.
Single metrics: Reducing complex performance to a single number. A model with 85% overall accuracy might have 95% accuracy on common cases and 30% on rare but important ones.
Eval contamination: Using test cases that appeared in the model's training data, producing artificially high scores.

Evals in practice

Organisations building AI-powered applications should:

Create eval suites tailored to their specific use cases
Run evals before every model update or prompt change
Track eval scores over time to detect gradual degradation
Include adversarial test cases that probe known weaknesses
Supplement automated evals with periodic human evaluation

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Evals are what separate responsible AI deployment from guesswork. Without systematic evaluation, you cannot know whether your AI system is improving, degrading, or maintaining quality. Building a strong eval practice is one of the highest-leverage investments an AI team can make.

Related Terms

Benchmark

A standardised test or dataset used to measure and compare the performance of different AI models on specific tasks.

Model Evaluation

The systematic process of measuring an AI model's performance using held-out data and appropriate metrics to determine whether it is good enough for its intended use.

Red Teaming (AI)

Systematically testing an AI system by trying to make it fail, produce harmful output, or violate its guidelines — to find and fix vulnerabilities before users do.

LLMOps

The set of practices, tools, and processes for deploying, monitoring, and maintaining large language model applications in production — an evolution of MLOps for the generative AI era.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Building Your First AI Workflow

← Back to Glossary