Capstone: A Data Analysis Report

A capstone is a project that ties everything together — here you'll run a complete data analysis in R, from importing and cleaning a dataset to summarising, modelling, and reporting the results.

Learn Capstone: A Data Analysis Report in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick…

Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

By the end of this lesson you'll have walked the full analysis pipeline — read.csv and tidyr to clean, dplyr to aggregate, lm to model, and a written summary — using every skill from the course in one flow.

What You'll Learn in This Lesson

1️⃣ Import and Clean

Every analysis starts by getting data in and tidying it. Here we read CSV text, drop a row with a missing value using drop_na() , and add a computed revenue column with mutate() .

2️⃣ Aggregate and Summarise

With clean data, we answer the business question: revenue and average units per region. group_by() + summarise() + arrange() produce a tidy report.

3️⃣ Model and Communicate

Finally we model the relationship between price and units with lm() , read the slope and R-squared, and (in RStudio) draw a ggplot2 chart to communicate the finding.

Your turn. Fill in the # TODO blank, run it, and compare with the expected output.

Run the entire pipeline yourself, from raw text to a written conclusion. This is the whole course in one exercise — import, clean, aggregate, sort, and report.

📋 Quick Reference — Analysis Pipeline

Practice quiz

Which function reads a CSV file into a data frame in base R?

load.csv()
import.csv()
read.csv()
csv.read()

Answer: read.csv(). read.csv() reads comma-separated data into a data frame.

In dplyr, which pair produces a grouped summary?

group_by() then summarise()
filter() then arrange()
mutate() then select()
join() then rename()

Answer: group_by() then summarise(). group_by() defines groups and summarise() reduces each to summary rows.

What does mutate(revenue = units * price) do?

Removes the units column
Filters rows where units > price
Sorts by price
Adds a new revenue column computed per row

Answer: Adds a new revenue column computed per row. mutate() adds or changes columns; the calculation is vectorized per row.

Which dplyr verb sorts rows from highest to lowest revenue?

sort(revenue)
arrange(desc(revenue))
order(-revenue)
rank(revenue)

Answer: arrange(desc(revenue)). arrange(desc(...)) sorts in descending order in a dplyr pipeline.

What does sum(revenue) compute inside summarise()?

The total of the revenue values in each group
The number of rows
The average revenue
The largest revenue

Answer: The total of the revenue values in each group. sum() adds the values; n() would count rows instead.

Which function fits a simple linear model of units on price?

glm.fit(units, price)
model(units, price)
lm(units ~ price, data = data)
regress(units ~ price)

Answer: lm(units ~ price, data = data). lm() fits linear models using the y ~ x formula interface.

In a typical analysis, when should cleaning (handling NAs) happen?

After producing the final report
Before aggregating or modelling
Only inside the plot
Never, R handles it silently

Answer: Before aggregating or modelling. Clean first so totals and models aren't distorted by missing or bad values.

What does n() return inside summarise()?

The sum of a column
The mean of a column
The number of columns
The number of rows in the current group

Answer: The number of rows in the current group. n() counts rows per group; useful alongside sum() and mean().

Why finish a grouped summary with .groups = "drop"?

To delete the data frame
So later steps operate on ungrouped data
To sort the result
To round the numbers

Answer: So later steps operate on ungrouped data. Dropping groups (or ungroup()) prevents surprises in later pipeline steps.

Why pull report numbers from code rather than typing them by hand?

To make the file larger
Because typing is disallowed in R
So the report can never drift out of sync with the data
To avoid using dplyr

Answer: So the report can never drift out of sync with the data. Generating numbers from live code keeps the report reproducible and accurate.