Skip to main content
Early access β€” new tools and guides added regularly
Core AI

Model Evaluation

Last reviewed: April 2026

The systematic process of measuring an AI model's performance using held-out data and appropriate metrics to determine whether it is good enough for its intended use.

Model evaluation is the process of measuring how well an AI model performs its intended task. It is the difference between "this model seems to work" and "this model meets our performance requirements for deployment."

The fundamental principle

Never evaluate a model on the data it was trained on. Training data evaluation tells you how well the model memorised, not how well it generalises. Always use a separate test set β€” data the model has never seen.

The train-validation-test split

  • Training set (typically 70-80%) β€” used to train the model
  • Validation set (10-15%) β€” used to tune hyperparameters and make design decisions during development
  • Test set (10-15%) β€” used only once, at the end, to get a final unbiased performance estimate

Common evaluation metrics

For classification: - Accuracy, precision, recall, F1 score, AUC-ROC

For regression: - Mean squared error, mean absolute error, R-squared

For language models: - Perplexity, BLEU score, ROUGE score, human evaluation

For generative models: - FID score (image quality), human preference ratings

Cross-validation

When data is limited, k-fold cross-validation provides more reliable estimates. The data is split into k folds, and the model is trained and evaluated k times, each time using a different fold as the test set. Results are averaged.

Beyond metrics: business evaluation

Technical metrics are necessary but not sufficient. Business evaluation asks:

  • Does the model's performance meet the minimum threshold for this use case?
  • What is the cost of the model's errors in business terms?
  • How does the model perform on edge cases that matter most?
  • Is the model's performance consistent across different user segments?
  • Does the model perform well enough to justify the cost of deploying and maintaining it?

Evaluation pitfalls

  • Data leakage β€” accidentally including test data information in training, inflating results
  • Overfitting to the test set β€” repeatedly evaluating on the same test data and tweaking until scores improve
  • Aggregate blindness β€” good overall metrics that mask poor performance on important subgroups
Want to go deeper?
This topic is covered in our Practitioner level. Access all 60+ lessons free.

Why This Matters

Rigorous model evaluation prevents your organisation from deploying AI that looks good in demos but fails in production. It is the quality gate between experimentation and deployment. Understanding evaluation helps you ask the right questions and set meaningful performance thresholds for AI projects.

Related Terms

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Building Your First AI Workflow