Checkpoint: Statistical Modeling

This checkpoint consolidates your advanced R skills — matrix algebra, stringr, ANOVA, time series, data.table, Shiny, R Markdown, closures, and debugging — into one practical statistical-modeling exercise that fits and interprets a linear model.

Learn Checkpoint: Statistical Modeling in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick…

Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

You'll review everything you've built, tackle a multi-step modeling challenge with a worked solution, and test yourself with a short quiz.

What This Checkpoint Covers

📚 What You've Mastered

Before the challenge, here's the toolkit you've assembled across the advanced track. Each row is a skill you'll lean on when modeling real data.

Work through the four steps on the inline study dataset: fit a linear model, read its summary, make a prediction, and compare it against a quadratic model with anova() . Attempt it yourself first, then expand the solution to compare your approach.

A complete, commented solution. The exact numbers depend on the fit, so run it to see them; the comments explain how to interpret each piece.

Answer each in your head, then expand to check. No peeking first.

It's the slope: the average change in score for each one-unit increase in hours . If it's about 5, each extra hour of study is associated with roughly 5 more points.

Yes. At the usual 0.05 threshold, 0.002 is well below it, so the slope is statistically distinguishable from zero — the predictor has a detectable effect.

Plain R-squared never decreases when you add predictors, even worthless ones. Adjusted R-squared penalizes extra terms, so it only rises when a new predictor genuinely improves the fit — making it fairer for comparing models of different sizes.

Keep the simpler fit . A p-value of 0.40 means the extra terms in fit2 don't significantly improve the fit, so the added complexity isn't justified.

Use predict(fit, newdata = data.frame(hours = 9)) . The newdata frame must use the same column names as the predictors in the model.

Call traceback() immediately to see the call stack and find which function actually failed, then drop a browser() or use debugonce() there to inspect the inputs.

Practice quiz

In lm(score ~ hours, data = study), what does the 'hours' coefficient (Estimate) represent?

The predicted score at hours = 0
The total variance explained
The average change in score per one extra hour
The p-value of the model

Answer: The average change in score per one extra hour. The coefficient is the slope: the average change in the outcome for each one-unit increase in that predictor.

A predictor's Pr(>|t|) is 0.002. At the usual 0.05 threshold, is its slope significant?

Yes, 0.002 is well below 0.05
No, 0.002 is too small to matter
Only if R-squared is high
It cannot be determined

Answer: Yes, 0.002 is well below 0.05. 0.002 is well below the 0.05 threshold, so the slope is statistically distinguishable from zero.

Why prefer ADJUSTED R-squared over Multiple R-squared when comparing models of different sizes?

It is always larger
It ignores the intercept
It is faster to compute
It penalizes extra predictors, so it only rises for genuinely useful terms

Answer: It penalizes extra predictors, so it only rises for genuinely useful terms. Plain R-squared never decreases when you add predictors, even useless ones. Adjusted R-squared penalizes extra terms.

anova(fit, fit2) gives Pr(>F) = 0.40. Which model do you keep?

The more complex fit2
The simpler fit
Neither, refit both
Whichever has higher R-squared

Answer: The simpler fit. A p-value of 0.40 means the extra terms in fit2 do not significantly improve the fit, so keep the simpler model.

How do you predict the outcome for 9 hours of study with a fitted model 'fit'?

predict(fit, newdata = data.frame(hours = 9))
fit(9)
summary(fit, 9)
fit$predict(9)

Answer: predict(fit, newdata = data.frame(hours = 9)). Use predict() with a newdata frame whose column names match the model's predictors.

Your lm() errors deep inside a helper function. What is the best first debugging move?

Rewrite the whole script
Increase the data size
Call traceback() to see the call stack
Remove the intercept

Answer: Call traceback() to see the call stack. traceback() shows the call stack so you can find which function actually failed, then use browser() or debugonce() there.

How do aov() and lm() relate?

They fit completely different models
They fit the same linear model but present results differently
aov() only works on time series
lm() cannot handle categorical predictors

Answer: They fit the same linear model but present results differently. Under the hood they fit the same linear model; lm() is framed for regression coefficients, aov() for an ANOVA table with an F-test.

Which stringr function tests whether a pattern is PRESENT in each string?

str_replace()
str_split()
str_extract()
str_detect()

Answer: str_detect(). str_detect() returns a logical vector of whether the pattern matches. str_extract() pulls out the matched text.

In data.table syntax dt[i, j, by], what does the 'by' argument do?

Filters rows
Groups rows for aggregation
Selects columns
Sorts the table

Answer: Groups rows for aggregation. i filters rows, j selects/computes columns, and by groups the rows so j is computed per group.

In a closure, what does the <<- operator do?

Creates a local variable
Defines a new function
Assigns to a variable in the enclosing environment, persisting private state
Compares two values

Answer: Assigns to a variable in the enclosing environment, persisting private state. <<- assigns up into the enclosing environment, which is how function factories keep persistent private state between calls.