Core AI

Validation Set

Last reviewed: April 2026

A portion of data held back from training and used to evaluate an AI model's performance during development, helping prevent overfitting.

A validation set is a portion of your data that you set aside and do not use for training. Instead, you use it to check how well your model performs on data it has never seen. This simple practice is one of the most important techniques in machine learning — it is how you know whether your model has actually learned useful patterns or just memorised the training data.

The three-way data split

Standard practice divides your data into three sets:

Training set (70-80%): Used to train the model. The model sees this data and learns from it.
Validation set (10-15%): Used during development to tune the model. You check performance on this set to make decisions about model architecture, hyperparameters, and when to stop training.
Test set (10-15%): Used once at the end to get a final, unbiased performance estimate. You never make decisions based on test set performance.

Why you need a separate validation set

If you only measured performance on training data, you would have no way to detect overfitting. A model might achieve 99 percent accuracy on training data but only 60 percent on new data. The validation set reveals this gap.

Think of it like studying for an exam. The training data is the textbook you study. The validation set is the practice exam. The test set is the real exam. If you perform well on the practice exam, you are probably prepared. If you only reviewed the textbook without testing yourself, you might discover too late that you memorised answers without understanding concepts.

How the validation set guides development

During model development, you repeatedly check validation performance to:

Choose between models: Compare random forest vs gradient boosting vs neural network on validation data
Tune hyperparameters: Try different learning rates, tree depths, or regularization strengths
Decide when to stop training: Stop when validation performance plateaus or starts declining (early stopping)
Select features: Determine which input features improve validation performance

Cross-validation

When you have limited data, a fixed validation set might not be representative. Cross-validation solves this by rotating which data serves as the validation set:

Split data into k folds (typically 5 or 10)
Train on k-1 folds, validate on the remaining fold
Repeat k times, each time using a different fold for validation
Average the results for a more robust estimate

Common mistakes

Data leakage: Information from the validation set accidentally influencing training (e.g., normalising using statistics from the full dataset)
Repeated peeking: Making so many decisions based on validation performance that the model becomes overfit to the validation set
Non-representative splits: Splitting time-series data randomly instead of chronologically, causing future data to leak into training

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

The validation set is your reality check. Without it, you have no reliable way to know whether an AI model will perform well in production. Understanding validation practices helps you ask the right questions when evaluating AI solutions and avoid the common trap of trusting impressive-looking metrics that do not reflect real-world performance.

Related Terms

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Underfitting

When an AI model is too simple to capture the patterns in the data, resulting in poor performance on both training data and new data.

Regularization

A set of techniques that prevent AI models from memorising training data too closely, helping them perform better on new, unseen data.

Supervised Learning

A machine learning approach where the model learns from labelled examples — input data paired with correct answers. The most common type of machine learning in business applications.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Evaluating AI Models and Results

← Back to Glossary