Core AI

Noise in Data

Last reviewed: April 2026

Random errors, irrelevant information, or inconsistencies in a dataset that can mislead AI models and reduce their performance.

Noise in data refers to random errors, irrelevant information, or inconsistencies that obscure the true patterns a model is trying to learn. Every real-world dataset contains some noise — the question is how much and how to handle it.

Sources of noise

Measurement errors — sensors producing inaccurate readings, typos in manual data entry
Labelling errors — annotators assigning incorrect labels to training data
Irrelevant features — variables that have no relationship to the prediction target but are included in the dataset
Outliers — data points that are far from the norm, whether due to errors or genuine rare events
Missing data — gaps that are filled with estimates or defaults, introducing imprecision
Temporal noise — data collected during unusual periods (holidays, outages) that does not represent normal patterns

How noise affects models

Underfitting — if noise overwhelms the signal, the model cannot learn meaningful patterns at all
Overfitting — models with enough capacity will learn the noise along with the signal, memorising random fluctuations that do not generalise
Reduced accuracy — even models that generalise reasonably will perform worse on noisy data than clean data
Bias — systematic noise (not random) can introduce bias into model predictions

Dealing with noise

Data cleaning — identifying and correcting or removing erroneous data points before training
Robust loss functions — using loss functions that are less sensitive to outliers
Regularisation — techniques like dropout and weight decay that prevent the model from memorising noise
Ensemble methods — combining multiple models to average out noise effects
Data augmentation — increasing dataset size to improve the signal-to-noise ratio
Feature selection — removing irrelevant features that add noise without information

The signal-to-noise ratio

The key concept is the signal-to-noise ratio. Models learn from signal (real patterns) and are misled by noise (random variation). Everything in data preparation and model training aims to maximise this ratio.

Noise is not always bad

Controlled noise injection (like adding noise to training images) can actually improve model robustness by preventing overfitting to exact training examples.

Want to go deeper?

This topic is covered in our Foundations level. Access all 100+ lessons free.

Why This Matters

Data quality is the number one determinant of AI project success, and noise is the most common data quality problem. Understanding noise helps you prioritise data cleaning over model complexity — a simple model on clean data almost always outperforms a complex model on noisy data.

Related Terms

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Overfitting

When an AI model performs excellently on training data but poorly on new data because it has memorised specific examples rather than learning general patterns.

Data Augmentation

Techniques for artificially expanding a training dataset by creating modified versions of existing data, improving model performance without collecting new data.

Learn More

Continue learning in Foundations

This topic is covered in our lesson: What Is Artificial Intelligence (Really)?

← Back to Glossary