One-Hot Encoding with get_dummies

One-hot encoding is the technique of replacing one text category column with several 0/1 indicator columns — one per category — so that machine learning models can read your categorical data as plain numbers without inventing a false order.

Learn One-Hot Encoding with get_dummies in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a…

Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

You will use pd.get_dummies() with columns= , drop_first= , and dtype=int , understand why models need it, and reverse the process with pd.from_dummies() .

Suppose a color column holds "red", "green", "blue". If you naively map them to 0, 1, 2 the model thinks blue (2) is twice green (1) — a meaning that does not exist. One-hot encoding instead creates a separate column per category that is 1 when the row matches and 0 otherwise, so no fake ordering sneaks in.

Use columns=[...] to encode only the columns you name and leave the rest alone. Use drop_first=True to drop the first indicator column — it is redundant because "all zeros" already identifies that category, and keeping it can confuse regression models (the dummy variable trap).

Once you are done processing, you often want the readable label back. pd.from_dummies() collapses the one-hot columns into a single category column again. It needs a sep= so it knows where the prefix ends, matching the underscore that get_dummies inserted.

Modern pandas returns booleans by default, which can break later math:

❌ from_dummies fails on a drop_first encoding

✅ Fix: decode the full encoding, or give a default:

Turn a tiny housing table into a numeric feature matrix.

Lesson complete — your categories are model-ready!

You can expand text columns into 0/1 indicators with get_dummies , target specific columns, drop the redundant level, force integers with dtype=int , and reverse it all with from_dummies .

🚀 Up next: Rolling & Expanding Windows — smooth and accumulate values over a sliding window.

Practice quiz

What does one-hot encoding do to a text category column?

Replaces it with a separate 0/1 column per category
Deletes it
Maps it to 0, 1, 2 in order
Sorts it alphabetically

Answer: Replaces it with a separate 0/1 column per category. Each category gets its own indicator column that is 1 when the row matches, else 0.

Why is mapping categories to 0, 1, 2 a problem for many models?

It uses too much memory
It implies a false ordering between categories
It is slower to compute
It cannot be reversed

Answer: It implies a false ordering between categories. Models treat numbers as ordered, so 0/1/2 wrongly implies one category is 'greater'.

In each row of a one-hot encoding, how many indicator columns are 'hot' (1)?

All of them
None
Exactly one
Two

Answer: Exactly one. Exactly one column is 1 per row, which is why it is called one-hot.

What does the columns=[...] argument of get_dummies do?

Renames the output columns
Sets the column order
Drops those columns
Encodes only the columns you name

Answer: Encodes only the columns you name. columns=[...] limits encoding to the listed columns and leaves the rest alone.

What does drop_first=True do?

Drops the first redundant indicator column
Removes the first row
Sorts categories
Keeps only the first category

Answer: Drops the first redundant indicator column. It removes one redundant column to avoid the dummy variable trap in regression.

Why pass dtype=int to get_dummies in modern pandas?

To sort the output
Modern pandas returns True/False by default; dtype=int gives clean 0 and 1
To reduce the number of columns
It is required or it errors

Answer: Modern pandas returns True/False by default; dtype=int gives clean 0 and 1. By default pandas returns booleans; dtype=int produces integer 0/1 columns.

Which function reverses one-hot encoding back to a single column?

pd.melt()
pd.undummy()
pd.from_dummies()
pd.decode()

Answer: pd.from_dummies(). pd.from_dummies() reconstructs the original categorical column.

Which kind of model usually does NOT need drop_first=True?

Linear regression
Logistic regression
Models that assume no multicollinearity
Tree-based models

Answer: Tree-based models. Tree models do not suffer the dummy trap, so you usually keep every column there.

For get_dummies on color values red, green, blue with dtype=int, what is the sum across the new columns in any single row?

Answer: 1. Exactly one indicator is 1 per row, so the row sum is 1.

Why can from_dummies fail on a drop_first encoding?

It needs more than one row
An all-zeros row has no category to recover
It only works with strings
It needs a numeric index

Answer: An all-zeros row has no category to recover. With the first column dropped, an all-zeros row is ambiguous unless you give default_category.