One-Hot Encoding with get_dummies
One-hot encoding is the technique of replacing one text category column with several 0/1 indicator columns — one per category — so that machine learning models can read your categorical data as plain numbers without inventing a false order.
Learn One-Hot Encoding with get_dummies in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a…
Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
You will use pd.get_dummies() with columns= , drop_first= , and dtype=int , understand why models need it, and reverse the process with pd.from_dummies() .
Suppose a color column holds "red", "green", "blue". If you naively map them to 0, 1, 2 the model thinks blue (2) is twice green (1) — a meaning that does not exist. One-hot encoding instead creates a separate column per category that is 1 when the row matches and 0 otherwise, so no fake ordering sneaks in.
Use columns=[...] to encode only the columns you name and leave the rest alone. Use drop_first=True to drop the first indicator column — it is redundant because "all zeros" already identifies that category, and keeping it can confuse regression models (the dummy variable trap).
Once you are done processing, you often want the readable label back. pd.from_dummies() collapses the one-hot columns into a single category column again. It needs a sep= so it knows where the prefix ends, matching the underscore that get_dummies inserted.
Modern pandas returns booleans by default, which can break later math:
❌ from_dummies fails on a drop_first encoding
✅ Fix: decode the full encoding, or give a default:
Turn a tiny housing table into a numeric feature matrix.
Lesson complete — your categories are model-ready!
You can expand text columns into 0/1 indicators with get_dummies , target specific columns, drop the redundant level, force integers with dtype=int , and reverse it all with from_dummies .
🚀 Up next: Rolling & Expanding Windows — smooth and accumulate values over a sliding window.
Practice quiz
What does one-hot encoding do to a text category column?
- Replaces it with a separate 0/1 column per category
- Deletes it
- Maps it to 0, 1, 2 in order
- Sorts it alphabetically
Answer: Replaces it with a separate 0/1 column per category. Each category gets its own indicator column that is 1 when the row matches, else 0.
Why is mapping categories to 0, 1, 2 a problem for many models?
- It uses too much memory
- It implies a false ordering between categories
- It is slower to compute
- It cannot be reversed
Answer: It implies a false ordering between categories. Models treat numbers as ordered, so 0/1/2 wrongly implies one category is 'greater'.
In each row of a one-hot encoding, how many indicator columns are 'hot' (1)?
- All of them
- None
- Exactly one
- Two
Answer: Exactly one. Exactly one column is 1 per row, which is why it is called one-hot.
What does the columns=[...] argument of get_dummies do?
- Renames the output columns
- Sets the column order
- Drops those columns
- Encodes only the columns you name
Answer: Encodes only the columns you name. columns=[...] limits encoding to the listed columns and leaves the rest alone.
What does drop_first=True do?
- Drops the first redundant indicator column
- Removes the first row
- Sorts categories
- Keeps only the first category
Answer: Drops the first redundant indicator column. It removes one redundant column to avoid the dummy variable trap in regression.
Why pass dtype=int to get_dummies in modern pandas?
- To sort the output
- Modern pandas returns True/False by default; dtype=int gives clean 0 and 1
- To reduce the number of columns
- It is required or it errors
Answer: Modern pandas returns True/False by default; dtype=int gives clean 0 and 1. By default pandas returns booleans; dtype=int produces integer 0/1 columns.
Which function reverses one-hot encoding back to a single column?
- pd.melt()
- pd.undummy()
- pd.from_dummies()
- pd.decode()
Answer: pd.from_dummies(). pd.from_dummies() reconstructs the original categorical column.
Which kind of model usually does NOT need drop_first=True?
- Linear regression
- Logistic regression
- Models that assume no multicollinearity
- Tree-based models
Answer: Tree-based models. Tree models do not suffer the dummy trap, so you usually keep every column there.
For get_dummies on color values red, green, blue with dtype=int, what is the sum across the new columns in any single row?
- 1
- 0
- 2
- 3
Answer: 1. Exactly one indicator is 1 per row, so the row sum is 1.
Why can from_dummies fail on a drop_first encoding?
- It needs more than one row
- An all-zeros row has no category to recover
- It only works with strings
- It needs a numeric index
Answer: An all-zeros row has no category to recover. With the first column dropped, an all-zeros row is ambiguous unless you give default_category.