Practical

Data Labelling

Last reviewed: April 2026

The process of tagging raw data with informative labels so supervised machine learning models can learn the relationship between inputs and desired outputs.

Data labelling is the process of assigning meaningful tags or annotations to raw data — text, images, audio, video — so that machine learning models can learn from it. It is the human work that makes supervised learning possible.

Why labelling matters

Supervised machine learning requires examples of inputs paired with correct outputs. A spam filter needs thousands of emails labelled "spam" or "not spam." An image classifier needs pictures labelled with their contents. Without labels, the model has no ground truth to learn from.

Types of labelling tasks

Classification labels — assigning a category to an entire item (this email is spam, this review is negative)
Bounding boxes — drawing rectangles around objects in images
Segmentation — labelling every pixel in an image (this pixel is road, this pixel is car)
Named entity tagging — marking words in text as people, organisations, locations, dates
Transcription — converting speech to text with timestamps and speaker identification
Ranking — ordering items by quality or relevance (used in RLHF for language model training)

Who does the labelling

In-house teams — domain experts who understand the nuances of your data
Crowdsourcing platforms — services like Amazon Mechanical Turk for large-scale, simpler labelling tasks
Specialised vendors — companies like Scale AI, Labelbox, or Appen that provide trained labelling workforces
AI-assisted labelling — using models to generate initial labels that humans review and correct, dramatically speeding up the process

Quality challenges

Ambiguity — reasonable people may disagree on the correct label. Clear labelling guidelines and inter-annotator agreement metrics are essential.
Fatigue — repetitive labelling leads to errors. Rotation and quality checks help.
Bias — labellers bring their own perspectives, which can introduce systematic bias into training data.
Cost — labelling is often the most expensive part of an AI project, particularly for specialised domains requiring expert annotators.

Want to go deeper?

This topic is covered in our Practitioner level. Access all 100+ lessons free.

Why This Matters

Data labelling is the hidden cost that catches many AI projects off guard. Understanding it helps you budget realistically, plan timelines accurately, and recognise that the quality of your labels directly determines the quality of your model. Poor labels in means poor predictions out.

Related Terms

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Annotation

The process of adding labels or tags to raw data so AI models can learn from it during training.

Supervised Learning

A machine learning approach where the model learns from labelled examples — input data paired with correct answers. The most common type of machine learning in business applications.

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Learn More

Continue learning in Practitioner

This topic is covered in our lesson: Building Your First AI Workflow

← Back to Glossary