Data Leakage
A critical error in AI development where information from outside the training set improperly influences the model, producing unrealistically good results that fail in production.
Data leakage occurs when information that should not be available to an AI model during training inadvertently influences its learning, leading to artificially inflated performance metrics that do not reflect real-world capability. It is one of the most common and damaging mistakes in machine learning projects.
How data leakage happens
Leakage takes many forms, but the core problem is always the same: the model learns from information it would not have access to in production.
Common scenarios include:
- Target leakage: A feature in the training data directly encodes the outcome you are trying to predict. For example, including "treatment prescribed" as a feature when predicting whether a patient has a disease β the treatment would not be known before the diagnosis.
- Train-test contamination: Test data accidentally appears in the training set, or the training process uses information derived from the test set (such as normalisation statistics computed on the full dataset).
- Temporal leakage: Using future data to predict past events. For example, using next month's sales figures as a feature when predicting this month's customer churn.
- Group leakage: When related data points (such as multiple records from the same customer) appear in both training and test sets, the model learns customer-specific patterns that inflate test performance.
Why leakage is so dangerous
The insidious thing about data leakage is that it makes your model look excellent during development. Accuracy metrics soar. Stakeholders get excited. The model ships to production β and immediately fails. The gap between development performance and production performance is often the first sign of leakage.
A classic example: a hospital built a model to predict pneumonia outcomes. The model discovered that patients transferred from certain wards had lower mortality β not because the ward was better, but because those patients were less severely ill. In production, this pattern was useless.
How to detect leakage
- Suspiciously high accuracy: If a model performs much better than expected, leakage should be your first suspect.
- Feature importance analysis: If the most important feature is one that would not logically be available at prediction time, investigate.
- Performance drop in production: A dramatic gap between development and production metrics almost always indicates leakage.
- Temporal validation: Always test on data from a later time period than training data.
How to prevent leakage
- Split your data before any preprocessing or feature engineering.
- Be explicit about what information would be available at prediction time.
- Use time-based splits for temporal data β never random splits.
- Review feature definitions with domain experts.
- Monitor production performance against development metrics.
Why This Matters
Data leakage is the most common reason AI projects succeed in development but fail in production. Understanding this concept helps you ask the right questions when evaluating AI solutions and avoid the costly mistake of deploying models that only appear to work.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Building Your First AI Workflow
Training your team on AI? Enigmatica offers structured enterprise training built on this curriculum. Explore enterprise AI training β