Data Labelling
The process of tagging raw data with informative labels so supervised machine learning models can learn the relationship between inputs and desired outputs.
Data labelling is the process of assigning meaningful tags or annotations to raw data β text, images, audio, video β so that machine learning models can learn from it. It is the human work that makes supervised learning possible.
Why labelling matters
Supervised machine learning requires examples of inputs paired with correct outputs. A spam filter needs thousands of emails labelled "spam" or "not spam." An image classifier needs pictures labelled with their contents. Without labels, the model has no ground truth to learn from.
Types of labelling tasks
- Classification labels β assigning a category to an entire item (this email is spam, this review is negative)
- Bounding boxes β drawing rectangles around objects in images
- Segmentation β labelling every pixel in an image (this pixel is road, this pixel is car)
- Named entity tagging β marking words in text as people, organisations, locations, dates
- Transcription β converting speech to text with timestamps and speaker identification
- Ranking β ordering items by quality or relevance (used in RLHF for language model training)
Who does the labelling
- In-house teams β domain experts who understand the nuances of your data
- Crowdsourcing platforms β services like Amazon Mechanical Turk for large-scale, simpler labelling tasks
- Specialised vendors β companies like Scale AI, Labelbox, or Appen that provide trained labelling workforces
- AI-assisted labelling β using models to generate initial labels that humans review and correct, dramatically speeding up the process
Quality challenges
- Ambiguity β reasonable people may disagree on the correct label. Clear labelling guidelines and inter-annotator agreement metrics are essential.
- Fatigue β repetitive labelling leads to errors. Rotation and quality checks help.
- Bias β labellers bring their own perspectives, which can introduce systematic bias into training data.
- Cost β labelling is often the most expensive part of an AI project, particularly for specialised domains requiring expert annotators.
Why This Matters
Data labelling is the hidden cost that catches many AI projects off guard. Understanding it helps you budget realistically, plan timelines accurately, and recognise that the quality of your labels directly determines the quality of your model. Poor labels in means poor predictions out.
Related Terms
Continue learning in Practitioner
This topic is covered in our lesson: Building Your First AI Workflow