Skip to main content
Early access β€” new tools and guides added regularly
Core AI

Validation Set

Last reviewed: April 2026

A portion of data held back from training and used to evaluate an AI model's performance during development, helping prevent overfitting.

A validation set is a portion of your data that you set aside and do not use for training. Instead, you use it to check how well your model performs on data it has never seen. This simple practice is one of the most important techniques in machine learning β€” it is how you know whether your model has actually learned useful patterns or just memorised the training data.

The three-way data split

Standard practice divides your data into three sets:

  • Training set (70-80%): Used to train the model. The model sees this data and learns from it.
  • Validation set (10-15%): Used during development to tune the model. You check performance on this set to make decisions about model architecture, hyperparameters, and when to stop training.
  • Test set (10-15%): Used once at the end to get a final, unbiased performance estimate. You never make decisions based on test set performance.

Why you need a separate validation set

If you only measured performance on training data, you would have no way to detect overfitting. A model might achieve 99 percent accuracy on training data but only 60 percent on new data. The validation set reveals this gap.

Think of it like studying for an exam. The training data is the textbook you study. The validation set is the practice exam. The test set is the real exam. If you perform well on the practice exam, you are probably prepared. If you only reviewed the textbook without testing yourself, you might discover too late that you memorised answers without understanding concepts.

How the validation set guides development

During model development, you repeatedly check validation performance to:

  • Choose between models: Compare random forest vs gradient boosting vs neural network on validation data
  • Tune hyperparameters: Try different learning rates, tree depths, or regularization strengths
  • Decide when to stop training: Stop when validation performance plateaus or starts declining (early stopping)
  • Select features: Determine which input features improve validation performance

Cross-validation

When you have limited data, a fixed validation set might not be representative. Cross-validation solves this by rotating which data serves as the validation set:

  1. Split data into k folds (typically 5 or 10)
  2. Train on k-1 folds, validate on the remaining fold
  3. Repeat k times, each time using a different fold for validation
  4. Average the results for a more robust estimate

Common mistakes

  • Data leakage: Information from the validation set accidentally influencing training (e.g., normalising using statistics from the full dataset)
  • Repeated peeking: Making so many decisions based on validation performance that the model becomes overfit to the validation set
  • Non-representative splits: Splitting time-series data randomly instead of chronologically, causing future data to leak into training
Want to go deeper?
This topic is covered in our Practitioner level. Access all 60+ lessons free.

Why This Matters

The validation set is your reality check. Without it, you have no reliable way to know whether an AI model will perform well in production. Understanding validation practices helps you ask the right questions when evaluating AI solutions and avoid the common trap of trusting impressive-looking metrics that do not reflect real-world performance.

Related Terms

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Evaluating AI Models and Results