Checkpoint: Analysis & Aggregation

This checkpoint is a hands-on review that combines every analysis skill from this section — binning, encoding, windows, group transforms, time-series math, and chaining — into one build challenge and a short self-test quiz.

Learn Checkpoint: Analysis & Aggregation in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a…

Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

Work through the multi-step challenge first, check your answer against the full solution, then test your recall with the quiz before moving on to the capstone.

Before the challenge, here is the mental map of when to reach for each tool you have met:

A quick three-skill warm-up: bin a column with cut , add a rolling average, and broadcast a group mean with transform . Run it and read each new column.

You are given a daily sales table per region . Build a single analysis that does all of the following, in order:

Try it in the editor first. The full reference solution is hidden just below — open it only after you have attempted the build.

agg shrinks the frame and misaligns on assignment:

Answer each in your head, then expand to check.

pd.qcut . It splits on quantiles so every bin is equally full, unlike pd.cut which uses fixed value edges.

dtype=int . And drop_first=True removes the redundant first indicator column.

A full window of 3 is not available yet for the first two rows. Pass min_periods=1 to compute with partial windows instead.

transform returns the same shape as the input (one value per row, broadcast within each group), while agg collapses each group to a single summary row.

diff() is the absolute change in units; pct_change() is the relative change as a fraction (a 0.25 means +25%).

It returns all rows of the groups that pass the test. The function receives each group as a sub-DataFrame and must return a single boolean (True keeps the whole group).

Checkpoint cleared — you can analyse like a pro!

You combined binning, rolling windows, and group transforms into one pipeline, summarised by group, and tested your recall. That is the full analysis toolkit working together.

🚀 Up next: Capstone — End-to-End Data Cleaning — apply everything to a complete, realistic dataset.

Practice quiz

Which function gives bins that each hold roughly the same number of rows?

pd.cut
pd.get_dummies
pd.qcut
df.rolling

Answer: pd.qcut. pd.qcut splits on quantiles so every bin is equally full. pd.cut uses fixed value edges, which can leave some bins nearly empty.

What argument makes pd.get_dummies return 0/1 integers instead of True/False?

dtype=int
drop_first=True
sparse=True
prefix='x'

Answer: dtype=int. dtype=int makes the indicator columns 0/1 integers. drop_first=True is a separate option that removes the redundant first indicator column.

Why are the first rows of a rolling(3).mean() NaN, and how can you fill them?

rolling always discards the first row
The data must be sorted first
You forgot to call .reset_index()
A full window of 3 is not available yet; pass min_periods=1 for partial windows

Answer: A full window of 3 is not available yet; pass min_periods=1 for partial windows. The first two rows do not have a full window of 3 values yet. Passing min_periods=1 lets pandas compute with partial windows instead of NaN.

How does groupby().transform differ from groupby().agg?

transform collapses each group to one row; agg keeps every row
transform returns the same shape as the input (one value per row); agg collapses each group to a single summary
They are identical
transform only works on strings

Answer: transform returns the same shape as the input (one value per row); agg collapses each group to a single summary. transform broadcasts a group statistic back onto every row, keeping the row count. agg collapses each group to a single summary row.

What is the difference between diff() and pct_change()?

diff is the absolute change in units; pct_change is the relative change as a fraction
diff is relative; pct_change is absolute
They both return percentages
diff only works on dates

Answer: diff is the absolute change in units; pct_change is the relative change as a fraction. diff() gives the absolute change in units, while pct_change() gives the relative change as a fraction, so 0.25 means a 25% increase.

What does groupby().filter() return, and what must its function output?

A single summary row; the function returns a number
Only the first row of each group; the function returns a string
All rows of the groups that pass the test; the function returns one boolean per group
A reshaped pivot table; the function returns a list

Answer: All rows of the groups that pass the test; the function returns one boolean per group. filter returns every row of the groups that pass. Its function receives each group as a sub-DataFrame and must return a single boolean — True keeps the whole group.

When you assign df['z'] = df.groupby('region')['sales'].mean(), what goes wrong?

Nothing, it works perfectly
agg shrinks the frame so the result misaligns by row; use transform instead
mean is not a valid method
It silently drops the region column

Answer: agg shrinks the frame so the result misaligns by row; use transform instead. agg/mean returns one value per group, not per row, so direct assignment misaligns. Use groupby('region')['sales'].transform('mean') to keep the row count.

Why pass observed=True when you groupby a categorical produced by pd.cut?

It sorts the groups alphabetically
It converts the result to integers
It is required for rolling windows
It keeps only the bins that actually appear, avoiding empty groups and warnings

Answer: It keeps only the bins that actually appear, avoiding empty groups and warnings. Grouping a categorical otherwise keeps every possible category, including empty bins. observed=True keeps only the bins that actually appear in the data.

Which window method weights recent observations more heavily than older ones?

.rolling
.ewm
.expanding
.cumsum

Answer: .ewm. .ewm computes an exponentially weighted mean, giving more weight to recent values. rolling treats every value in the window equally and expanding grows the window from the start.

What does df['sales'].shift(1) produce?

The cumulative sum of sales
The sales values sorted ascending
The sales values moved down one row, so each row sees the previous value
The percentage change of sales

Answer: The sales values moved down one row, so each row sees the previous value. shift(1) moves the column down by one row, so each row lines up with the previous period's value — the building block for diff and pct_change.