Categorical Data
Data that represents distinct groups or categories — like colours, countries, or product types — rather than continuous numerical values.
Categorical data is data that falls into distinct groups or categories rather than being measured on a numerical scale. Understanding categorical data matters because AI models handle it differently from numerical data, and getting this wrong can undermine your entire project.
Examples of categorical data
- Customer segments: "Enterprise," "SMB," "Individual"
- Product types: "Software," "Hardware," "Services"
- Regions: "North America," "Europe," "Asia-Pacific"
- Sentiment: "Positive," "Negative," "Neutral"
- Binary outcomes: "Yes/No," "Spam/Not spam," "Approved/Denied"
Nominal vs. ordinal
Categorical data comes in two flavours:
- Nominal — categories with no natural order. "Red," "Blue," "Green" are nominal. There is no sense in which red is greater than blue.
- Ordinal — categories with a natural ranking. "Low," "Medium," "High" are ordinal. The order matters, but the gaps between categories may not be equal.
How AI handles categorical data
Machine learning models work with numbers, not labels. Categorical data must be converted through encoding:
- One-hot encoding creates a binary column for each category. A "colour" column with three values becomes three columns: is_red, is_blue, is_green
- Label encoding assigns a number to each category (red=0, blue=1, green=2). This works for ordinal data but can mislead models with nominal data by implying an order
- Embedding learns a dense numerical representation of each category during training. This is how LLMs handle words — each word is a category mapped to a vector
Common pitfalls
- Using label encoding for nominal data (the model thinks category 3 is "bigger" than category 1)
- Having too many categories (high cardinality), which creates sparse, unwieldy one-hot encodings
- Ignoring rare categories that appear only a few times in training data
Why This Matters
Most business data is categorical — customer segments, product lines, regions, survey responses. When your data team prepares data for AI models, how they handle categorical variables directly affects model performance. Understanding the basics helps you ask informed questions about data preparation choices.
Related Terms
Continue learning in Foundations
This topic is covered in our lesson: What Is Artificial Intelligence (Really)?