Core AI

Model Evaluation

Last reviewed: April 2026

The systematic process of measuring an AI model's performance using held-out data and appropriate metrics to determine whether it is good enough for its intended use.

Model evaluation is the process of measuring how well an AI model performs its intended task. It is the difference between "this model seems to work" and "this model meets our performance requirements for deployment."

The fundamental principle

Never evaluate a model on the data it was trained on. Training data evaluation tells you how well the model memorised, not how well it generalises. Always use a separate test set — data the model has never seen.

The train-validation-test split

Training set (typically 70-80%) — used to train the model
Validation set (10-15%) — used to tune hyperparameters and make design decisions during development
Test set (10-15%) — used only once, at the end, to get a final unbiased performance estimate

Common evaluation metrics

For classification: - Accuracy, precision, recall, F1 score, AUC-ROC

For regression: - Mean squared error, mean absolute error, R-squared

For language models: - Perplexity, BLEU score, ROUGE score, human evaluation

For generative models: - FID score (image quality), human preference ratings

Cross-validation

When data is limited, k-fold cross-validation provides more reliable estimates. The data is split into k folds, and the model is trained and evaluated k times, each time using a different fold as the test set. Results are averaged.

Beyond metrics: business evaluation

Technical metrics are necessary but not sufficient. Business evaluation asks:

Does the model's performance meet the minimum threshold for this use case?
What is the cost of the model's errors in business terms?
How does the model perform on edge cases that matter most?
Is the model's performance consistent across different user segments?
Does the model perform well enough to justify the cost of deploying and maintaining it?

Evaluation pitfalls

Data leakage — accidentally including test data information in training, inflating results
Overfitting to the test set — repeatedly evaluating on the same test data and tweaking until scores improve
Aggregate blindness — good overall metrics that mask poor performance on important subgroups

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Rigorous model evaluation prevents your organisation from deploying AI that looks good in demos but fails in production. It is the quality gate between experimentation and deployment. Understanding evaluation helps you ask the right questions and set meaningful performance thresholds for AI projects.

Related Terms

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Benchmark

A standardised test or dataset used to measure and compare the performance of different AI models on specific tasks.

Accuracy (AI)

A metric that measures how often an AI model's predictions are correct, expressed as a percentage of total predictions.

Confusion Matrix

A table that shows the breakdown of a classification model's correct and incorrect predictions, revealing exactly where and how the model makes mistakes.

Precision and Recall

Two complementary metrics for classification models — precision measures how many predicted positives were correct, recall measures how many actual positives were found.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Building Your First AI Workflow

← Back to Glossary