Categorical Data
A categorical column is a pandas dtype that stores each repeated text value once and replaces every cell with a tiny integer code, saving memory and speeding up grouping.
Learn Categorical Data in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.
Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
Learn to convert columns with astype("category"), measure the memory savings, build ordered categories that sort and compare correctly, and use the .cat accessor.
A column like status that only ever holds "active" , "paused" , or "closed" wastes space as an object column — it stores the full word in every single row. Convert it with df["status"].astype("category") and pandas keeps just one copy of each label, replacing every cell with a small integer code that points back at the label.
Categorical columns get a dedicated namespace, .cat , just like text columns get .str and dates get .dt . Through it you read and reshape the category set: .cat.categories lists the labels, .cat.codes shows the integers, and methods such as .cat.add_categories() and .cat.remove_unused_categories() let you tidy the label list independently of the data.
Some categories have a natural ranking that the alphabet gets wrong: sorting ["Low", "Medium", "High"] as plain text gives High, Low, Medium , which is nonsense. An ordered category teaches pandas the real sequence. Then comparisons like df["size"] > "Low" mean what you expect, and sort_values follows your order rather than the alphabet.
Setting a brand-new label on a category column produces NaN or an error:
❌ Sorting an unordered category and expecting rank order
Turn free-text ratings into an ordered category and sort respondents.
Lesson complete — your repeated text is now cheap!
You can convert with astype("category") , measure the memory win, drive the .cat accessor, and build ordered categories that compare and sort by real rank.
🚀 Up next: Dropping & Deduplicating — remove unwanted rows and collapse duplicate records cleanly.
Practice quiz
When does the category dtype give the biggest benefit?
- When almost every value is unique
- When the column holds dates
- When a column has many rows but few distinct repeating values
- When there is only one row
Answer: When a column has many rows but few distinct repeating values. Category shines when few distinct values repeat across many rows, like status or country.
How does category save memory?
- It stores each label once and replaces cells with small integer codes
- It deletes rows
- It compresses the file on disk
- It rounds numbers
Answer: It stores each label once and replaces cells with small integer codes. Each distinct value is kept once in a lookup table; every cell becomes a tiny code pointing at it.
Which converts a column to the category dtype?
- c
astype('category') converts the column to categorical.
What does the .cat accessor provide?
- String methods
- Category-specific tools like .cat.categories and .cat.codes
- Date parts
- Plotting helpers
Answer: Category-specific tools like .cat.categories and .cat.codes. The .cat accessor exposes categories, codes, and methods to manage the category set.
For pd.Series(['S','M','L','M','S']).astype('category'), what is .cat.categories?
- L
- M
- S
Answer: L. Categories are stored sorted, giving ['L', 'M', 'S'].
For that same Series ['S','M','L','M','S'], what are the .cat.codes?
With categories ['L','M','S'] (codes L=0,M=1,S=2), the values map to [2, 1, 0, 1, 2].
What does an ordered category add over an unordered one?
- Faster grouping only
- A meaningful sequence so comparisons and sorting follow your order
- Automatic NaN filling
- Extra columns
Answer: A meaningful sequence so comparisons and sorting follow your order. An ordered category records a rank, so Low < Medium < High compares and sorts correctly.
With an ordered category Low<Medium<High applied to ['High','Low','Medium','High'], what is sort_values()?
- High
- Low
- Medium
- High
Answer: High. Sorting follows the declared order: Low, Medium, High, High.
What happens if you assign a value that is not an existing category?
- It is added automatically
- It raises an error or becomes NaN
- It is ignored
- It converts the column to object
Answer: It raises an error or becomes NaN. Unknown labels have nowhere to go; add them first with .cat.add_categories().
How do you create an ordered categorical type?
- astype('ordered')
- pd.sort_categories()
CategoricalDtype(categories, ordered=True) declares the ordered type.