Capstone: End-to-End Data Cleaning
A data cleaning pipeline is the ordered sequence of pandas steps that turns a messy raw file into an analysis-ready table — reading the data, inspecting it, fixing strings and dates, handling missing values, fixing dtypes, deriving new columns, filtering, summarising, and exporting the result. This capstone combines every skill from the course into one realistic workflow.
Learn Capstone: End-to-End Data Cleaning in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a…
Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
Always look before you clean. head() shows the shape of the data, info() reveals dtypes and null counts, and isna().sum() tells you exactly where the gaps are. Notice the messy strings, mixed casing, and missing values below.
Strip whitespace and normalise casing with the .str accessor. Parse dates with pd.to_datetime(..., errors="coerce") so unparseable values become NaT instead of crashing.
Here is the whole workflow end to end: load → clean → handle missing → fix dtypes → derive a column → filter → summarise → export. This is the kind of script you would run on a real messy export.
Replace the missing prices with 0 so the column has no gaps. Fill in the blank method.
Apply the whole pipeline to a small messy survey: clean the names, normalise the rating column, drop rows with no rating, then report the average rating per city.
Capstone complete — you finished the Pandas course!
You took raw, messy data and ran it through a full professional pipeline: load, inspect, clean strings and dates, handle missing values, fix dtypes, derive columns, filter, summarise, and export. Combined with everything earlier — Series, DataFrames, selection, filtering, aggregation, merging, reshaping, MultiIndex, and performance — you now have the complete toolkit of a working data analyst.
🎓 Congratulations! Take these skills to a real dataset and keep practising.
Practice quiz
What does io.StringIO let you do in these cleaning examples?
- Read CSV text from an in-memory string as if it were a file
- Download files from the web
- Compress data
- Connect to a database
Answer: Read CSV text from an in-memory string as if it were a file. io.StringIO wraps a string so pd.read_csv can read CSV text defined right in the script.
Which trio of calls is the standard first step to inspect raw data?
- head(), info(), isna().sum()
- drop(), rename(), sort()
- merge(), concat(), pivot()
- plot(), hist(), bar()
Answer: head(), info(), isna().sum(). head shows the shape, info shows dtypes and null counts, and isna().sum() locates the gaps.
What does errors='coerce' do in pd.to_datetime or pd.to_numeric?
- Raises on the first bad value
- Turns unparseable values into NaT/NaN instead of raising
- Deletes the bad rows
- Ignores the column
Answer: Turns unparseable values into NaT/NaN instead of raising. coerce converts values it cannot parse into NaT or NaN so you can handle them deliberately.
To strip whitespace and Title-Case a text column, which accessor do you use?
- .dt
- .cat
- .str
- .loc
Answer: .str. The .str accessor exposes string methods like .str.strip().str.title().
Which method replaces missing values with a chosen fill, like 0 or a median?
- dropna()
- fillna()
- drop_duplicates()
- reset_index()
Answer: fillna(). fillna substitutes a value for NaN; dropna removes rows/columns instead.
When is dropna a better choice than fillna for missing values?
- When a sensible default exists
- When the row is unusable without the value, like a missing key
- Always
- Never
Answer: When the row is unusable without the value, like a missing key. Drop rows that cannot be used without the value; fill when a reasonable default exists.
Which method removes exact duplicate rows from a DataFrame?
- drop_duplicates()
- unique()
- distinct()
- dedupe()
Answer: drop_duplicates(). drop_duplicates() collapses repeated identical rows.
Why convert a low-cardinality column with astype('category') during cleaning?
- To make it editable
- To save memory and speed up grouping
- To parse dates
- To drop nulls
Answer: To save memory and speed up grouping. Category stores repeated labels once with small integer codes, saving memory and speeding grouping.
Which method exports a cleaned DataFrame to CSV text or a file?
- to_csv()
- read_csv()
- to_datetime()
- save()
Answer: to_csv(). df.to_csv writes the DataFrame out as CSV.
What is the correct overall order of a cleaning pipeline?
- Export, then load, then inspect
- Load, inspect, clean, fix dtypes, derive, filter, summarise, export
- Summarise first, then load
- Filter before loading
Answer: Load, inspect, clean, fix dtypes, derive, filter, summarise, export. You load and inspect first, then clean and transform, and export the result last.