Dropping & Deduplicating Rows

Dropping is how you remove rows you don't want from a DataFrame — either by label with df.drop(index=...) or by collapsing repeated records with drop_duplicates().

Learn Dropping & Deduplicating Rows in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick…

Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

Learn to delete rows by index, remove them by condition, flag repeats with duplicated(), and deduplicate by a business key with subset= and keep=.

df.drop(index=[...]) removes rows by their index labels. It is direct when you know exactly which rows to delete. To remove a column instead, use df.drop(columns=[...]) — same method, different axis. But for removing rows that meet a condition , the pandas idiom is to keep the rows you want with a boolean mask rather than naming the ones to drop.

Before deleting anything, it pays to look. df.duplicated() returns a boolean Series: True for every row that is a repeat of an earlier one, False for the first time a row is seen. Sum it to count duplicates, or use it as a mask to view exactly which rows are about to be removed.

drop_duplicates() deletes the repeated rows and hands back the cleaned table. Two arguments give you precise control: subset= chooses which columns define a duplicate (e.g. dedupe by email alone), and keep= picks which copy survives — "first" (default), "last" , or False to discard every row that has any duplicate.

Passing a column name as if it were a row label:

❌ Duplicates "still there" after drop_duplicates

A messy registration export needs three cleaning passes.

Lesson complete — your tables are tidy!

You can drop rows by label or condition, drop columns with drop(columns=...) , flag repeats with duplicated() , and collapse them with drop_duplicates(subset=, keep=) .

🚀 Up next: Renaming & Relabelling — give your columns and index clean, consistent names.

Practice quiz

What does df.drop(index=[1]) remove?

  • The column named 1
  • The row whose index label is 1
  • The first value of every column
  • Nothing, it errors

Answer: The row whose index label is 1. drop(index=[1]) removes the row labelled 1.

How do you remove a whole column with drop?

  • df.drop('col')
  • col

Use df.drop(columns=[...]) to drop a column by name.

What is the pandas idiom for removing rows that meet a condition?

  • Keep the opposite rows with a boolean mask
  • Loop and call drop on each
  • Use df.delete(condition)
  • Set the rows to None

Answer: Keep the opposite rows with a boolean mask. Conditional removal is done by keeping the rows you want, e.g. df[df['age'] >= 18].

What does df.duplicated() return?

  • The deduplicated DataFrame
  • A count of duplicates
  • Only the unique rows
  • A boolean Series flagging repeat rows

Answer: A boolean Series flagging repeat rows. duplicated() returns True for each row that repeats an earlier one.

For df with emails ['a','b','a','c'], what does df.duplicated().sum() return?

  • 1
  • 2
  • 0
  • 3

Answer: 1. Only the second 'a' is a repeat, so the sum is 1.

What does keep=False do in drop_duplicates()?

  • Keeps only the first copy
  • Keeps only the last copy
  • Drops every row that has any duplicate
  • Keeps all rows unchanged

Answer: Drops every row that has any duplicate. keep=False removes all rows that are part of any duplicate set, leaving only truly unique rows.

What does the subset= argument of drop_duplicates control?

  • The output column order
  • Which columns define a duplicate
  • How many rows to keep
  • The sort order

Answer: Which columns define a duplicate. subset=['email'] judges duplicates by that column alone, like a business key.

Why might drop_duplicates appear to 'not work'?

  • It only works on numbers
  • The result was not assigned back
  • It needs an internet connection
  • Duplicates cannot be removed

Answer: The result was not assigned back. drop returns a new DataFrame; reassign it (df = df.drop_duplicates()) or the change is lost.

What is the default value of keep in drop_duplicates()?

  • 'last'
  • False
  • None
  • 'first'

Answer: 'first'. The default keep='first' retains the earliest occurrence of each duplicate.

To reliably keep the newest record per id, you should first:

  • Sort by the update column, then drop_duplicates with keep='last'
  • Call drop_duplicates twice
  • Use keep=False
  • Reset the index

Answer: Sort by the update column, then drop_duplicates with keep='last'. first/last are judged by row order, so sort by 'updated' then keep='last'.