Checkpoint: Data Analysis

This checkpoint consolidates the data-analysis half of the course — tibbles, reshaping, purrr, lubridate, ggplot2, distributions, hypothesis testing, and glm — into one end-to-end tidyverse pipeline and a self-check quiz.

Learn Checkpoint: Data Analysis in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.

Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

You'll recap each tool, tackle a build challenge that flows raw data through pivoting, grouped summaries, and a described plot, then test your understanding with a short checkpoint quiz before moving on to linear models.

What This Checkpoint Covers

📚 What You've Learned

Before the challenge, here's the toolkit you've built across the data-analysis lessons — each row is a job and the verb that does it.

🛠️ Build Challenge: A Tidyverse Pipeline

Put it together. Start from a wide inline tibble, reshape it long, summarise per group, and describe the plot you'd draw. Work through the # TODO s yourself before revealing the solution.

Stretch the build challenge further to weave in more of the course:

Build it in RStudio and inspect each intermediate result.

📝 Checkpoint Quiz

Answer each in your head (or in the console), then click to reveal. If any feel shaky, revisit that lesson before continuing.

pivot_longer() . The old column HEADERS go into the new column named by names_to (e.g. "year"), and the CELL values go into the column named by values_to (e.g. "revenue"). The reverse is pivot_wider() with names_from / values_from .

map() always returns a LIST, one element per input. map_dbl() returns a numeric VECTOR and errors if any result isn't a single double — a built-in type check. Use the typed variant whenever you expect a simple atomic result, because it catches mistakes immediately.

p answers "fraction below" — pnorm(60, mean, sd) gives the cumulative probability up to 60. q is its inverse and gives a value at a percentile — qnorm(0.90, mean, sd) returns the 90th-percentile value. d is the density (curve height) and r draws random numbers.

chisq.test() on a contingency table of counts. A t-test compares MEANS of a numeric variable; here the data are COUNTS of a categorical outcome, so chi-squared (testing association between layout and sign-up) is the right tool. For small, sparse tables, fisher.test() is even better.

Coefficients are log-odds, so apply exp(coef(model)) to read them as ODDS RATIOS (above 1 raises the odds, below 1 lowers them). For a probability on a new case, use predict(model, newdata, type = "response") — without type = "response" you'd get log-odds instead of a [0, 1] probability.

Both expect tidy data — one variable per column. If a variable (like "quarter") is spread across column headers, you can't map it to an aesthetic or group by it. Pivoting it into a real column makes it addressable, so aes(x = quarter) and group_by(quarter) work.

Practice quiz

Which function turns a WIDE table (one column per quarter) into LONG form?

  • pivot_wider()
  • spread()
  • pivot_longer()
  • gather_cols()

Answer: pivot_longer(). pivot_longer() collapses many columns into two: a names column and a values column. pivot_wider() does the reverse.

In pivot_longer(cols = -store, names_to = 'quarter', values_to = 'revenue'), where do the OLD column headers (Q1, Q2...) end up?

  • In the 'quarter' column
  • In the 'revenue' column
  • They are dropped
  • In the row names

Answer: In the 'quarter' column. names_to names the column that receives the old headers; values_to names the column that receives the cell values.

What does purrr's map() always return?

  • A numeric vector
  • A data frame
  • A single scalar
  • A list, one element per input

Answer: A list, one element per input. map() always returns a list. map_dbl() returns a double vector and errors if any result is not a single double.

Why use map_dbl() instead of map() when you expect numbers?

  • It is the only one that works on vectors
  • It returns a numeric vector and type-checks each result
  • It runs in parallel
  • It sorts the output

Answer: It returns a numeric vector and type-checks each result. map_dbl() returns an atomic double vector and errors immediately if a result is not a single double, catching mistakes early.

In the d/p/q/r distribution system, which prefix answers 'what fraction of values fall below 60?'

  • p (cumulative probability)
  • d (density)
  • q (quantile)
  • r (random)

Answer: p (cumulative probability). pnorm(60, mean, sd) gives the cumulative probability up to 60. q is its inverse (value at a percentile), d is density, r draws random numbers.

Which d/p/q/r prefix returns the 90th-percentile VALUE of a distribution?

  • pnorm
  • dnorm
  • qnorm
  • rnorm

Answer: qnorm. qnorm(0.90, mean, sd) returns the value at the 90th percentile. It is the inverse of pnorm.

Your outcome is 'signed up: yes/no' counted across two page layouts. Which test fits?

  • t.test()
  • chisq.test()
  • wilcox.test()
  • rnorm()

Answer: chisq.test(). The data are counts of a categorical outcome, so chisq.test() on a contingency table tests association. A t-test compares numeric means.

After glm(y ~ x, family = binomial), how do you turn a coefficient into an odds ratio?

  • log(coef(model))
  • 1 / coef(model)
  • coef(model)^2
  • exp(coef(model))

Answer: exp(coef(model)). glm coefficients are log-odds, so exp() converts them to odds ratios (above 1 raises the odds, below 1 lowers them).

To get a predicted PROBABILITY (in [0,1]) from a binomial glm, which call do you use?

  • predict(model, newdata)
  • predict(model, newdata, type = 'response')
  • summary(model)
  • coef(model)

Answer: predict(model, newdata, type = 'response'). Without type = 'response' predict() returns log-odds; type = 'response' maps them to a probability between 0 and 1.

Why pivot to LONG form before calling ggplot() or group_by()?

  • Long form uses less memory
  • Wide form cannot be plotted at all
  • They expect tidy data with one variable per column
  • It is required for fread()

Answer: They expect tidy data with one variable per column. Both need tidy data. A variable trapped in column headers (like 'quarter') must become a real column before you can map it to an aesthetic or group by it.