Checkpoint: Data Analysis
This checkpoint consolidates the data-analysis half of the course — tibbles, reshaping, purrr, lubridate, ggplot2, distributions, hypothesis testing, and glm — into one end-to-end tidyverse pipeline and a self-check quiz.
Learn Checkpoint: Data Analysis in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.
Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
You'll recap each tool, tackle a build challenge that flows raw data through pivoting, grouped summaries, and a described plot, then test your understanding with a short checkpoint quiz before moving on to linear models.
What This Checkpoint Covers
📚 What You've Learned
Before the challenge, here's the toolkit you've built across the data-analysis lessons — each row is a job and the verb that does it.
🛠️ Build Challenge: A Tidyverse Pipeline
Put it together. Start from a wide inline tibble, reshape it long, summarise per group, and describe the plot you'd draw. Work through the # TODO s yourself before revealing the solution.
Stretch the build challenge further to weave in more of the course:
Build it in RStudio and inspect each intermediate result.
📝 Checkpoint Quiz
Answer each in your head (or in the console), then click to reveal. If any feel shaky, revisit that lesson before continuing.
pivot_longer() . The old column HEADERS go into the new column named by names_to (e.g. "year"), and the CELL values go into the column named by values_to (e.g. "revenue"). The reverse is pivot_wider() with names_from / values_from .
map() always returns a LIST, one element per input. map_dbl() returns a numeric VECTOR and errors if any result isn't a single double — a built-in type check. Use the typed variant whenever you expect a simple atomic result, because it catches mistakes immediately.
p answers "fraction below" — pnorm(60, mean, sd) gives the cumulative probability up to 60. q is its inverse and gives a value at a percentile — qnorm(0.90, mean, sd) returns the 90th-percentile value. d is the density (curve height) and r draws random numbers.
chisq.test() on a contingency table of counts. A t-test compares MEANS of a numeric variable; here the data are COUNTS of a categorical outcome, so chi-squared (testing association between layout and sign-up) is the right tool. For small, sparse tables, fisher.test() is even better.
Coefficients are log-odds, so apply exp(coef(model)) to read them as ODDS RATIOS (above 1 raises the odds, below 1 lowers them). For a probability on a new case, use predict(model, newdata, type = "response") — without type = "response" you'd get log-odds instead of a [0, 1] probability.
Both expect tidy data — one variable per column. If a variable (like "quarter") is spread across column headers, you can't map it to an aesthetic or group by it. Pivoting it into a real column makes it addressable, so aes(x = quarter) and group_by(quarter) work.
Practice quiz
Which function turns a WIDE table (one column per quarter) into LONG form?
- pivot_wider()
- spread()
- pivot_longer()
- gather_cols()
Answer: pivot_longer(). pivot_longer() collapses many columns into two: a names column and a values column. pivot_wider() does the reverse.
In pivot_longer(cols = -store, names_to = 'quarter', values_to = 'revenue'), where do the OLD column headers (Q1, Q2...) end up?
- In the 'quarter' column
- In the 'revenue' column
- They are dropped
- In the row names
Answer: In the 'quarter' column. names_to names the column that receives the old headers; values_to names the column that receives the cell values.
What does purrr's map() always return?
- A numeric vector
- A data frame
- A single scalar
- A list, one element per input
Answer: A list, one element per input. map() always returns a list. map_dbl() returns a double vector and errors if any result is not a single double.
Why use map_dbl() instead of map() when you expect numbers?
- It is the only one that works on vectors
- It returns a numeric vector and type-checks each result
- It runs in parallel
- It sorts the output
Answer: It returns a numeric vector and type-checks each result. map_dbl() returns an atomic double vector and errors immediately if a result is not a single double, catching mistakes early.
In the d/p/q/r distribution system, which prefix answers 'what fraction of values fall below 60?'
- p (cumulative probability)
- d (density)
- q (quantile)
- r (random)
Answer: p (cumulative probability). pnorm(60, mean, sd) gives the cumulative probability up to 60. q is its inverse (value at a percentile), d is density, r draws random numbers.
Which d/p/q/r prefix returns the 90th-percentile VALUE of a distribution?
- pnorm
- dnorm
- qnorm
- rnorm
Answer: qnorm. qnorm(0.90, mean, sd) returns the value at the 90th percentile. It is the inverse of pnorm.
Your outcome is 'signed up: yes/no' counted across two page layouts. Which test fits?
- t.test()
- chisq.test()
- wilcox.test()
- rnorm()
Answer: chisq.test(). The data are counts of a categorical outcome, so chisq.test() on a contingency table tests association. A t-test compares numeric means.
After glm(y ~ x, family = binomial), how do you turn a coefficient into an odds ratio?
- log(coef(model))
- 1 / coef(model)
- coef(model)^2
- exp(coef(model))
Answer: exp(coef(model)). glm coefficients are log-odds, so exp() converts them to odds ratios (above 1 raises the odds, below 1 lowers them).
To get a predicted PROBABILITY (in [0,1]) from a binomial glm, which call do you use?
- predict(model, newdata)
- predict(model, newdata, type = 'response')
- summary(model)
- coef(model)
Answer: predict(model, newdata, type = 'response'). Without type = 'response' predict() returns log-odds; type = 'response' maps them to a probability between 0 and 1.
Why pivot to LONG form before calling ggplot() or group_by()?
- Long form uses less memory
- Wide form cannot be plotted at all
- They expect tidy data with one variable per column
- It is required for fread()
Answer: They expect tidy data with one variable per column. Both need tidy data. A variable trapped in column headers (like 'quarter') must become a real column before you can map it to an aesthetic or group by it.