Core AI

Data Leakage

Last reviewed: April 2026

A critical error in AI development where information from outside the training set improperly influences the model, producing unrealistically good results that fail in production.

Data leakage occurs when information that should not be available to an AI model during training inadvertently influences its learning, leading to artificially inflated performance metrics that do not reflect real-world capability. It is one of the most common and damaging mistakes in machine learning projects.

How data leakage happens

Leakage takes many forms, but the core problem is always the same: the model learns from information it would not have access to in production.

Common scenarios include:

Target leakage: A feature in the training data directly encodes the outcome you are trying to predict. For example, including "treatment prescribed" as a feature when predicting whether a patient has a disease — the treatment would not be known before the diagnosis.
Train-test contamination: Test data accidentally appears in the training set, or the training process uses information derived from the test set (such as normalisation statistics computed on the full dataset).
Temporal leakage: Using future data to predict past events. For example, using next month's sales figures as a feature when predicting this month's customer churn.
Group leakage: When related data points (such as multiple records from the same customer) appear in both training and test sets, the model learns customer-specific patterns that inflate test performance.

Why leakage is so dangerous

The insidious thing about data leakage is that it makes your model look excellent during development. Accuracy metrics soar. Stakeholders get excited. The model ships to production — and immediately fails. The gap between development performance and production performance is often the first sign of leakage.

A classic example: a hospital built a model to predict pneumonia outcomes. The model discovered that patients transferred from certain wards had lower mortality — not because the ward was better, but because those patients were less severely ill. In production, this pattern was useless.

How to detect leakage

Suspiciously high accuracy: If a model performs much better than expected, leakage should be your first suspect.
Feature importance analysis: If the most important feature is one that would not logically be available at prediction time, investigate.
Performance drop in production: A dramatic gap between development and production metrics almost always indicates leakage.
Temporal validation: Always test on data from a later time period than training data.

How to prevent leakage

Split your data before any preprocessing or feature engineering.
Be explicit about what information would be available at prediction time.
Use time-based splits for temporal data — never random splits.
Review feature definitions with domain experts.
Monitor production performance against development metrics.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Data leakage is the most common reason AI projects succeed in development but fail in production. Understanding this concept helps you ask the right questions when evaluating AI solutions and avoid the costly mistake of deploying models that only appear to work.

Related Terms

Overfitting

When an AI model performs excellently on training data but poorly on new data because it has memorised specific examples rather than learning general patterns.

Cross-Validation

A statistical technique for evaluating AI models by splitting data into multiple training and testing subsets to get a reliable estimate of real-world performance.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Validation Set

A portion of data held back from training and used to evaluate an AI model's performance during development, helping prevent overfitting.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Building Your First AI Workflow

← Back to Glossary