Categorical Data

A categorical column is a pandas dtype that stores each repeated text value once and replaces every cell with a tiny integer code, saving memory and speeding up grouping.

Learn Categorical Data in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.

Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

Learn to convert columns with astype("category"), measure the memory savings, build ordered categories that sort and compare correctly, and use the .cat accessor.

A column like status that only ever holds "active" , "paused" , or "closed" wastes space as an object column — it stores the full word in every single row. Convert it with df["status"].astype("category") and pandas keeps just one copy of each label, replacing every cell with a small integer code that points back at the label.

Categorical columns get a dedicated namespace, .cat , just like text columns get .str and dates get .dt . Through it you read and reshape the category set: .cat.categories lists the labels, .cat.codes shows the integers, and methods such as .cat.add_categories() and .cat.remove_unused_categories() let you tidy the label list independently of the data.

Some categories have a natural ranking that the alphabet gets wrong: sorting ["Low", "Medium", "High"] as plain text gives High, Low, Medium , which is nonsense. An ordered category teaches pandas the real sequence. Then comparisons like df["size"] > "Low" mean what you expect, and sort_values follows your order rather than the alphabet.

Setting a brand-new label on a category column produces NaN or an error:

❌ Sorting an unordered category and expecting rank order

Turn free-text ratings into an ordered category and sort respondents.

Lesson complete — your repeated text is now cheap!

You can convert with astype("category") , measure the memory win, drive the .cat accessor, and build ordered categories that compare and sort by real rank.

🚀 Up next: Dropping & Deduplicating — remove unwanted rows and collapse duplicate records cleanly.

Practice quiz

When does the category dtype give the biggest benefit?

  • When almost every value is unique
  • When the column holds dates
  • When a column has many rows but few distinct repeating values
  • When there is only one row

Answer: When a column has many rows but few distinct repeating values. Category shines when few distinct values repeat across many rows, like status or country.

How does category save memory?

  • It stores each label once and replaces cells with small integer codes
  • It deletes rows
  • It compresses the file on disk
  • It rounds numbers

Answer: It stores each label once and replaces cells with small integer codes. Each distinct value is kept once in a lookup table; every cell becomes a tiny code pointing at it.

Which converts a column to the category dtype?

  • c

astype('category') converts the column to categorical.

What does the .cat accessor provide?

  • String methods
  • Category-specific tools like .cat.categories and .cat.codes
  • Date parts
  • Plotting helpers

Answer: Category-specific tools like .cat.categories and .cat.codes. The .cat accessor exposes categories, codes, and methods to manage the category set.

For pd.Series(['S','M','L','M','S']).astype('category'), what is .cat.categories?

  • L
  • M
  • S

Answer: L. Categories are stored sorted, giving ['L', 'M', 'S'].

For that same Series ['S','M','L','M','S'], what are the .cat.codes?

With categories ['L','M','S'] (codes L=0,M=1,S=2), the values map to [2, 1, 0, 1, 2].

What does an ordered category add over an unordered one?

  • Faster grouping only
  • A meaningful sequence so comparisons and sorting follow your order
  • Automatic NaN filling
  • Extra columns

Answer: A meaningful sequence so comparisons and sorting follow your order. An ordered category records a rank, so Low < Medium < High compares and sorts correctly.

With an ordered category Low<Medium<High applied to ['High','Low','Medium','High'], what is sort_values()?

  • High
  • Low
  • Medium
  • High

Answer: High. Sorting follows the declared order: Low, Medium, High, High.

What happens if you assign a value that is not an existing category?

  • It is added automatically
  • It raises an error or becomes NaN
  • It is ignored
  • It converts the column to object

Answer: It raises an error or becomes NaN. Unknown labels have nowhere to go; add them first with .cat.add_categories().

How do you create an ordered categorical type?

  • astype('ordered')
  • pd.sort_categories()

CategoricalDtype(categories, ordered=True) declares the ordered type.