Dropping & Deduplicating Rows
Dropping is how you remove rows you don't want from a DataFrame — either by label with df.drop(index=...) or by collapsing repeated records with drop_duplicates().
Learn Dropping & Deduplicating Rows in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick…
Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
Learn to delete rows by index, remove them by condition, flag repeats with duplicated(), and deduplicate by a business key with subset= and keep=.
df.drop(index=[...]) removes rows by their index labels. It is direct when you know exactly which rows to delete. To remove a column instead, use df.drop(columns=[...]) — same method, different axis. But for removing rows that meet a condition , the pandas idiom is to keep the rows you want with a boolean mask rather than naming the ones to drop.
Before deleting anything, it pays to look. df.duplicated() returns a boolean Series: True for every row that is a repeat of an earlier one, False for the first time a row is seen. Sum it to count duplicates, or use it as a mask to view exactly which rows are about to be removed.
drop_duplicates() deletes the repeated rows and hands back the cleaned table. Two arguments give you precise control: subset= chooses which columns define a duplicate (e.g. dedupe by email alone), and keep= picks which copy survives — "first" (default), "last" , or False to discard every row that has any duplicate.
Passing a column name as if it were a row label:
❌ Duplicates "still there" after drop_duplicates
A messy registration export needs three cleaning passes.
Lesson complete — your tables are tidy!
You can drop rows by label or condition, drop columns with drop(columns=...) , flag repeats with duplicated() , and collapse them with drop_duplicates(subset=, keep=) .
🚀 Up next: Renaming & Relabelling — give your columns and index clean, consistent names.
Practice quiz
What does df.drop(index=[1]) remove?
- The column named 1
- The row whose index label is 1
- The first value of every column
- Nothing, it errors
Answer: The row whose index label is 1. drop(index=[1]) removes the row labelled 1.
How do you remove a whole column with drop?
- df.drop('col')
- col
Use df.drop(columns=[...]) to drop a column by name.
What is the pandas idiom for removing rows that meet a condition?
- Keep the opposite rows with a boolean mask
- Loop and call drop on each
- Use df.delete(condition)
- Set the rows to None
Answer: Keep the opposite rows with a boolean mask. Conditional removal is done by keeping the rows you want, e.g. df[df['age'] >= 18].
What does df.duplicated() return?
- The deduplicated DataFrame
- A count of duplicates
- Only the unique rows
- A boolean Series flagging repeat rows
Answer: A boolean Series flagging repeat rows. duplicated() returns True for each row that repeats an earlier one.
For df with emails ['a','b','a','c'], what does df.duplicated().sum() return?
- 1
- 2
- 0
- 3
Answer: 1. Only the second 'a' is a repeat, so the sum is 1.
What does keep=False do in drop_duplicates()?
- Keeps only the first copy
- Keeps only the last copy
- Drops every row that has any duplicate
- Keeps all rows unchanged
Answer: Drops every row that has any duplicate. keep=False removes all rows that are part of any duplicate set, leaving only truly unique rows.
What does the subset= argument of drop_duplicates control?
- The output column order
- Which columns define a duplicate
- How many rows to keep
- The sort order
Answer: Which columns define a duplicate. subset=['email'] judges duplicates by that column alone, like a business key.
Why might drop_duplicates appear to 'not work'?
- It only works on numbers
- The result was not assigned back
- It needs an internet connection
- Duplicates cannot be removed
Answer: The result was not assigned back. drop returns a new DataFrame; reassign it (df = df.drop_duplicates()) or the change is lost.
What is the default value of keep in drop_duplicates()?
- 'last'
- False
- None
- 'first'
Answer: 'first'. The default keep='first' retains the earliest occurrence of each duplicate.
To reliably keep the newest record per id, you should first:
- Sort by the update column, then drop_duplicates with keep='last'
- Call drop_duplicates twice
- Use keep=False
- Reset the index
Answer: Sort by the update column, then drop_duplicates with keep='last'. first/last are judged by row order, so sort by 'updated' then keep='last'.