What is categorical data?

Categorical data is information that falls into distinct groups or labels rather than measuring quantities. It describes what something is, not how much of it there is. Someone's preferred mode of transport (car, bicycle, public transit), their favorite coffee order, or whether an email is spam are all categorical. The values represent different buckets, not points on a scale.

This is what separates it from numerical data, where the values carry real mathematical weight. A 200-square-meter house is genuinely twice as large as a 100-square-meter one. But a postal code of 20004 isn't “twice as significant” as 10002. They're just two different places.

Categorical data comes in two forms.

Nominal: Categories with no natural order, like hair color and country of birth
Ordinal: Categories that follow a clear sequence, like clothing sizes (small, medium, large) or satisfaction ratings (poor, average, good), but where the gaps between values aren't precisely measurable

In machine learning contexts, categorical data must first be converted into numbers since models can only train on numerical values.