Core AI

Data Augmentation

Last reviewed: April 2026

Techniques for artificially expanding a training dataset by creating modified versions of existing data, improving model performance without collecting new data.

Data augmentation is a set of techniques for increasing the size and diversity of a training dataset by creating modified versions of existing data. It is one of the most cost-effective ways to improve model performance.

Why augmentation matters

AI models learn better from more data. But collecting and annotating new data is expensive and slow. Augmentation lets you multiply your existing data by creating plausible variations, giving the model more examples to learn from without the cost of new data collection.

Image augmentation techniques

Geometric transformations — rotating, flipping, cropping, or scaling images
Colour adjustments — changing brightness, contrast, saturation, or adding colour jitter
Noise injection — adding random noise to make the model robust to imperfect inputs
Cutout and mixup — masking portions of images or blending two images together

Text augmentation techniques

Synonym replacement — swapping words with synonyms while preserving meaning
Back-translation — translating text to another language and back, producing natural paraphrases
Random insertion, deletion, or swap — minor perturbations that teach robustness
LLM-based augmentation — using a language model to generate paraphrases or entirely new examples in the same style

Audio augmentation

Adding background noise, changing speed or pitch, time-shifting — making speech recognition models robust to real-world conditions

Best practices

Augmented data should be plausible — extreme distortions can hurt rather than help
Augmentation should preserve labels — a horizontally flipped cat is still a cat, but a horizontally flipped "6" might look like a "9"
Use augmentation to address class imbalance by generating more examples of underrepresented categories
Validate that augmentation actually improves performance on a held-out test set

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Data augmentation can save your organisation significant time and money in AI projects. Instead of spending months collecting more training data, strategic augmentation can achieve comparable improvements at a fraction of the cost. It is particularly valuable for specialised domains where labelled data is scarce.

Related Terms

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Supervised Learning

A machine learning approach where the model learns from labelled examples — input data paired with correct answers. The most common type of machine learning in business applications.

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Building Your First AI Workflow

← Back to Glossary