Core AI

Categorical Data

Last reviewed: April 2026

Data that represents distinct groups or categories — like colours, countries, or product types — rather than continuous numerical values.

Categorical data is data that falls into distinct groups or categories rather than being measured on a numerical scale. Understanding categorical data matters because AI models handle it differently from numerical data, and getting this wrong can undermine your entire project.

Examples of categorical data

Customer segments: "Enterprise," "SMB," "Individual"
Product types: "Software," "Hardware," "Services"
Regions: "North America," "Europe," "Asia-Pacific"
Sentiment: "Positive," "Negative," "Neutral"
Binary outcomes: "Yes/No," "Spam/Not spam," "Approved/Denied"

Nominal vs. ordinal

Categorical data comes in two flavours:

Nominal — categories with no natural order. "Red," "Blue," "Green" are nominal. There is no sense in which red is greater than blue.
Ordinal — categories with a natural ranking. "Low," "Medium," "High" are ordinal. The order matters, but the gaps between categories may not be equal.

How AI handles categorical data

Machine learning models work with numbers, not labels. Categorical data must be converted through encoding:

One-hot encoding creates a binary column for each category. A "colour" column with three values becomes three columns: is_red, is_blue, is_green
Label encoding assigns a number to each category (red=0, blue=1, green=2). This works for ordinal data but can mislead models with nominal data by implying an order
Embedding learns a dense numerical representation of each category during training. This is how LLMs handle words — each word is a category mapped to a vector

Common pitfalls

Using label encoding for nominal data (the model thinks category 3 is "bigger" than category 1)
Having too many categories (high cardinality), which creates sparse, unwieldy one-hot encodings
Ignoring rare categories that appear only a few times in training data

Want to go deeper?

This topic is covered in our Foundations level. Access all 100+ lessons free.

Why This Matters

Most business data is categorical — customer segments, product lines, regions, survey responses. When your data team prepares data for AI models, how they handle categorical variables directly affects model performance. Understanding the basics helps you ask informed questions about data preparation choices.

Related Terms

Machine Learning (ML)

A type of AI where systems learn patterns from data instead of following explicitly programmed rules. The system improves its performance through experience.

Training Data

The dataset used to teach an AI model. The quality, size, and composition of training data directly determines what the AI can and cannot do well.

Artificial Intelligence (AI)

Software that can perform tasks that normally require human intelligence, such as understanding language, recognising patterns, and making decisions.

Learn More

Continue learning in Foundations

This topic is covered in our lesson: What Is Artificial Intelligence (Really)?

← Back to Glossary