Data Augmentation
Techniques for artificially expanding a training dataset by creating modified versions of existing data, improving model performance without collecting new data.
Data augmentation is a set of techniques for increasing the size and diversity of a training dataset by creating modified versions of existing data. It is one of the most cost-effective ways to improve model performance.
Why augmentation matters
AI models learn better from more data. But collecting and annotating new data is expensive and slow. Augmentation lets you multiply your existing data by creating plausible variations, giving the model more examples to learn from without the cost of new data collection.
Image augmentation techniques
- Geometric transformations β rotating, flipping, cropping, or scaling images
- Colour adjustments β changing brightness, contrast, saturation, or adding colour jitter
- Noise injection β adding random noise to make the model robust to imperfect inputs
- Cutout and mixup β masking portions of images or blending two images together
Text augmentation techniques
- Synonym replacement β swapping words with synonyms while preserving meaning
- Back-translation β translating text to another language and back, producing natural paraphrases
- Random insertion, deletion, or swap β minor perturbations that teach robustness
- LLM-based augmentation β using a language model to generate paraphrases or entirely new examples in the same style
Audio augmentation
- Adding background noise, changing speed or pitch, time-shifting β making speech recognition models robust to real-world conditions
Best practices
- Augmented data should be plausible β extreme distortions can hurt rather than help
- Augmentation should preserve labels β a horizontally flipped cat is still a cat, but a horizontally flipped "6" might look like a "9"
- Use augmentation to address class imbalance by generating more examples of underrepresented categories
- Validate that augmentation actually improves performance on a held-out test set
Why This Matters
Data augmentation can save your organisation significant time and money in AI projects. Instead of spending months collecting more training data, strategic augmentation can achieve comparable improvements at a fraction of the cost. It is particularly valuable for specialised domains where labelled data is scarce.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Building Your First AI Workflow