Capstone: A Data Analysis Report
A capstone is a project that ties everything together — here you'll run a complete data analysis in R, from importing and cleaning a dataset to summarising, modelling, and reporting the results.
Learn Capstone: A Data Analysis Report in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick…
Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
By the end of this lesson you'll have walked the full analysis pipeline — read.csv and tidyr to clean, dplyr to aggregate, lm to model, and a written summary — using every skill from the course in one flow.
What You'll Learn in This Lesson
1️⃣ Import and Clean
Every analysis starts by getting data in and tidying it. Here we read CSV text, drop a row with a missing value using drop_na() , and add a computed revenue column with mutate() .
2️⃣ Aggregate and Summarise
With clean data, we answer the business question: revenue and average units per region. group_by() + summarise() + arrange() produce a tidy report.
3️⃣ Model and Communicate
Finally we model the relationship between price and units with lm() , read the slope and R-squared, and (in RStudio) draw a ggplot2 chart to communicate the finding.
Your turn. Fill in the # TODO blank, run it, and compare with the expected output.
Run the entire pipeline yourself, from raw text to a written conclusion. This is the whole course in one exercise — import, clean, aggregate, sort, and report.
📋 Quick Reference — Analysis Pipeline
Practice quiz
Which function reads a CSV file into a data frame in base R?
- load.csv()
- import.csv()
- read.csv()
- csv.read()
Answer: read.csv(). read.csv() reads comma-separated data into a data frame.
In dplyr, which pair produces a grouped summary?
- group_by() then summarise()
- filter() then arrange()
- mutate() then select()
- join() then rename()
Answer: group_by() then summarise(). group_by() defines groups and summarise() reduces each to summary rows.
What does mutate(revenue = units * price) do?
- Removes the units column
- Filters rows where units > price
- Sorts by price
- Adds a new revenue column computed per row
Answer: Adds a new revenue column computed per row. mutate() adds or changes columns; the calculation is vectorized per row.
Which dplyr verb sorts rows from highest to lowest revenue?
- sort(revenue)
- arrange(desc(revenue))
- order(-revenue)
- rank(revenue)
Answer: arrange(desc(revenue)). arrange(desc(...)) sorts in descending order in a dplyr pipeline.
What does sum(revenue) compute inside summarise()?
- The total of the revenue values in each group
- The number of rows
- The average revenue
- The largest revenue
Answer: The total of the revenue values in each group. sum() adds the values; n() would count rows instead.
Which function fits a simple linear model of units on price?
- glm.fit(units, price)
- model(units, price)
- lm(units ~ price, data = data)
- regress(units ~ price)
Answer: lm(units ~ price, data = data). lm() fits linear models using the y ~ x formula interface.
In a typical analysis, when should cleaning (handling NAs) happen?
- After producing the final report
- Before aggregating or modelling
- Only inside the plot
- Never, R handles it silently
Answer: Before aggregating or modelling. Clean first so totals and models aren't distorted by missing or bad values.
What does n() return inside summarise()?
- The sum of a column
- The mean of a column
- The number of columns
- The number of rows in the current group
Answer: The number of rows in the current group. n() counts rows per group; useful alongside sum() and mean().
Why finish a grouped summary with .groups = "drop"?
- To delete the data frame
- So later steps operate on ungrouped data
- To sort the result
- To round the numbers
Answer: So later steps operate on ungrouped data. Dropping groups (or ungroup()) prevents surprises in later pipeline steps.
Why pull report numbers from code rather than typing them by hand?
- To make the file larger
- Because typing is disallowed in R
- So the report can never drift out of sync with the data
- To avoid using dplyr
Answer: So the report can never drift out of sync with the data. Generating numbers from live code keeps the report reproducible and accurate.