Checkpoint: Cleaning & Wrangling

A checkpoint is a hands-on review where you combine every cleaning skill — dtypes, dropping, renaming, counting, querying, and replacing — to wrangle one messy DataFrame into shape.

Learn Checkpoint: Cleaning & Wrangling in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a…

Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

You'll recap the whole cleaning toolkit, tackle a multi-step build challenge with a full solution, and test yourself with a checkpoint quiz before moving on.

Here is a small but realistically grubby export from a sign-up form. The numbers arrived as text, there is a duplicate row, the headers are ugly, a status code needs expanding, and there is an under-age entry. Your job is to turn it into clean, analysis-ready data using the skills from this unit.

Work through these five steps. Each one uses a skill from this unit.

A clean pipeline is repeatable. Run the finished version and confirm the dtypes are right, the duplicate is gone, and only adults remain. Checking df.dtypes and len(df) at the end is a habit worth keeping.

Filtering a text column with a number compares strings, not values:

Each step returns a new frame; without reassignment they vanish:

Answer each in your head, then expand to check.

df.set_index("col") . Use reset_index() to push it back to an ordinary column.

At least one value is non-numeric text, so pandas stored the column as strings. Clean it and convert with astype , or use pd.to_numeric(col, errors="coerce") to turn bad values into NaN.

Sort by the timestamp, then df.drop_duplicates(subset=["email"], keep="last") . The subset judges duplicates by email; keep="last" retains the most recent.

Proportions that sum to 1 instead of raw counts — each category's share of the column. Multiply by 100 for percentages.

Prefix it with @: df.query("age > @cutoff") . Without the @, pandas looks for a column called cutoff.

where keeps the original value where cond is True and inserts 0 where it is False. mask is the opposite: it inserts 0 where cond is True. Mask hides (replaces) the True cells.

Checkpoint cleared — you can wrangle messy data!

You combined dtype fixes, deduplication, renaming, code expansion, and filtering into one repeatable pipeline — the everyday work of real data cleaning.

🚀 Up next: Binning Continuous Data — turn continuous numbers into labelled buckets with cut and qcut.

Practice quiz

Which method promotes an ordinary column to become the row index?

df.reset_index("col")
df.set_index("col")
df.reindex("col")
df.index("col")

Answer: df.set_index("col"). df.set_index("col") moves a column into the index. reset_index() pushes it back to a normal column.

A column of numbers shows dtype "object". What is the usual cause?

The values are stored as strings
The column has too many rows
The index is not sorted
The column name has spaces

Answer: The values are stored as strings. object dtype means at least one value is text, so pandas stored the whole column as strings. Convert with astype or to_numeric.

Your numeric filter df.query("age >= 18") returned no rows on clean-looking data. Why?

age is still text (object dtype)
query() does not support >=
18 must be quoted
You forgot to sort first

Answer: age is still text (object dtype). If age is still object dtype, the string "30" is not greater than the number 18, so the comparison silently matches nothing.

Which call safely converts bad values to NaN instead of raising during conversion?

astype("int")
pd.to_numeric(col, errors="coerce")
col.replace(int)
col.fillna("int")

Answer: pd.to_numeric(col, errors="coerce"). pd.to_numeric(col, errors="coerce") turns non-numeric values into NaN rather than throwing, unlike astype.

To keep one row per email, retaining the newest, what do you do after sorting by timestamp?

Answer: email. With the data sorted by time, drop_duplicates(subset=["email"], keep="last") keeps the most recent record per email.

What does value_counts(normalize=True) return?

Raw integer counts
Proportions that sum to 1
The unique values only
A sorted index

Answer: Proportions that sum to 1. normalize=True returns each category share as a proportion summing to 1; multiply by 100 for percentages.

Inside a query string, how do you reference a Python variable named cutoff?

df.query("age > cutoff")
df.query("age > @cutoff")
df.query("age > $cutoff")
df.query("age > {cutoff}")

Answer: df.query("age > @cutoff"). Prefix the variable with @: df.query("age > @cutoff"). Without @, pandas looks for a column named cutoff.

What is the difference between df.where(cond, 0) and df.mask(cond, 0)?

They are identical
where keeps values where cond is True; mask replaces values where cond is True
where only works on numbers
mask drops rows

Answer: where keeps values where cond is True; mask replaces values where cond is True. where keeps the original where cond is True (inserting 0 where False); mask is the opposite, inserting 0 where cond is True.

Which method expands code values, e.g. {"P": "pro", "F": "free"}, across a column?

plan

Answer: plan. Series.replace({...}) maps each old value to its replacement, expanding short codes into readable labels.

Why must cleaning steps like df.drop_duplicates() be reassigned or chained?

They run too slowly otherwise
They return a new frame; without reassignment the change is lost
They modify the index only
They require strict mode

Answer: They return a new frame; without reassignment the change is lost. Most cleaning methods return a new DataFrame and do not mutate in place, so you must reassign (df = ...) or chain them.