Categorical Data

A categorical column is a pandas dtype that stores each repeated text value once and replaces every cell with a tiny integer code, saving memory and speeding up grouping.

Learn Categorical Data in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.

Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

Learn to convert columns with astype("category"), measure the memory savings, build ordered categories that sort and compare correctly, and use the .cat accessor.

A column like status that only ever holds "active" , "paused" , or "closed" wastes space as an object column — it stores the full word in every single row. Convert it with df["status"].astype("category") and pandas keeps just one copy of each label, replacing every cell with a small integer code that points back at the label.

Categorical columns get a dedicated namespace, .cat , just like text columns get .str and dates get .dt . Through it you read and reshape the category set: .cat.categories lists the labels, .cat.codes shows the integers, and methods such as .cat.add_categories() and .cat.remove_unused_categories() let you tidy the label list independently of the data.

Some categories have a natural ranking that the alphabet gets wrong: sorting ["Low", "Medium", "High"] as plain text gives High, Low, Medium , which is nonsense. An ordered category teaches pandas the real sequence. Then comparisons like df["size"] > "Low" mean what you expect, and sort_values follows your order rather than the alphabet.

Setting a brand-new label on a category column produces NaN or an error:

❌ Sorting an unordered category and expecting rank order

Turn free-text ratings into an ordered category and sort respondents.

Lesson complete — your repeated text is now cheap!

You can convert with astype("category") , measure the memory win, drive the .cat accessor, and build ordered categories that compare and sort by real rank.

🚀 Up next: Dropping & Deduplicating — remove unwanted rows and collapse duplicate records cleanly.

Practice quiz

When does the category dtype give the biggest benefit?

When almost every value is unique
When the column holds dates
When a column has many rows but few distinct repeating values
When there is only one row

Answer: When a column has many rows but few distinct repeating values. Category shines when few distinct values repeat across many rows, like status or country.

How does category save memory?

It stores each label once and replaces cells with small integer codes
It deletes rows
It compresses the file on disk
It rounds numbers

Answer: It stores each label once and replaces cells with small integer codes. Each distinct value is kept once in a lookup table; every cell becomes a tiny code pointing at it.

Which converts a column to the category dtype?

astype('category') converts the column to categorical.

What does the .cat accessor provide?

String methods
Category-specific tools like .cat.categories and .cat.codes
Date parts
Plotting helpers

Answer: Category-specific tools like .cat.categories and .cat.codes. The .cat accessor exposes categories, codes, and methods to manage the category set.

For pd.Series(['S','M','L','M','S']).astype('category'), what is .cat.categories?

Answer: L. Categories are stored sorted, giving ['L', 'M', 'S'].

For that same Series ['S','M','L','M','S'], what are the .cat.codes?

With categories ['L','M','S'] (codes L=0,M=1,S=2), the values map to [2, 1, 0, 1, 2].

What does an ordered category add over an unordered one?

Faster grouping only
A meaningful sequence so comparisons and sorting follow your order
Automatic NaN filling
Extra columns

Answer: A meaningful sequence so comparisons and sorting follow your order. An ordered category records a rank, so Low < Medium < High compares and sorts correctly.

With an ordered category Low<Medium<High applied to ['High','Low','Medium','High'], what is sort_values()?

High
Low
Medium
High

Answer: High. Sorting follows the declared order: Low, Medium, High, High.

What happens if you assign a value that is not an existing category?

It is added automatically
It raises an error or becomes NaN
It is ignored
It converts the column to object

Answer: It raises an error or becomes NaN. Unknown labels have nowhere to go; add them first with .cat.add_categories().

How do you create an ordered categorical type?

astype('ordered')
pd.sort_categories()

CategoricalDtype(categories, ordered=True) declares the ordered type.