Cross-Validation & ROC

Cross-validation and ROC analysis answer the question that matters most in modelling: how well will this model do on data it has never seen?

Learn Cross-Validation & ROC in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.

Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

By the end of this lesson you'll split data and run k-fold CV and LOOCV, spot overfitting, and read ROC curves, AUC, sensitivity, and specificity to judge a classifier honestly.

What You'll Learn in This Lesson

1️⃣ Train/Test Split and Overfitting

Hold out part of the data, train on the rest, and compare errors. When the training error is far lower than the test error, the model is overfitting — memorizing noise instead of signal.

2️⃣ k-Fold Cross-Validation

A single split wastes data and varies with luck. trainControl(method = "cv", number = 10) rotates the validation role through 10 folds and averages the result. method = "LOOCV" takes it to the extreme.

3️⃣ The ROC Curve and AUC

For classifiers, pROC::roc() sweeps the decision threshold and plots sensitivity against 1 - specificity. The auc() summarizes the whole curve: 1.0 is perfect, 0.5 is chance.

4️⃣ Sensitivity & Specificity

At a fixed threshold, the confusion matrix gives Sensitivity (the true positive rate) and Specificity (the true negative rate). confusionMatrix() computes both.

Your turn. Fill in the # TODO blank, run it, and compare with the expected output.

Build labels and scores, then compute the AUC with pROC and check it beats chance. AUC above 0.5 means the model ranks positives above negatives more often than not.

📋 Quick Reference — CV & ROC

Practice quiz

What does k-fold cross-validation do?

  • Splits data into k folds, training on k-1 and validating on the held-out fold, k times
  • Trains once on all data
  • Removes k columns
  • Picks k random rows to delete

Answer: Splits data into k folds, training on k-1 and validating on the held-out fold, k times. Each of the k folds serves as the validation set once while the rest train the model.

What is LOOCV?

  • A loss function
  • A plotting function
  • Leave-one-out cross-validation, where each single observation is the validation set
  • A type of join

Answer: Leave-one-out cross-validation, where each single observation is the validation set. LOOCV is k-fold CV with k equal to the number of observations; each row is held out once.

Why hold out a test set instead of evaluating on training data?

  • To make training faster
  • Training-set scores are optimistic and hide overfitting
  • It is required by R syntax
  • To save memory

Answer: Training-set scores are optimistic and hide overfitting. A model scored on its own training data looks better than it generalizes; a held-out set gives an honest estimate.

What is overfitting?

  • A model with no parameters
  • A model too simple to learn
  • Using too few predictors
  • A model that fits training noise and generalizes poorly

Answer: A model that fits training noise and generalizes poorly. Overfitting means the model captures noise specific to the training data and performs worse on new data.

What does the ROC curve plot?

  • True positive rate against false positive rate across thresholds
  • Loss vs iterations
  • Precision vs sample size
  • Accuracy vs epochs

Answer: True positive rate against false positive rate across thresholds. An ROC curve plots sensitivity (TPR) against 1 - specificity (FPR) over all thresholds.

What does AUC measure?

  • The number of folds
  • The area under the ROC curve, summarizing ranking ability
  • The training time
  • The number of predictors

Answer: The area under the ROC curve, summarizing ranking ability. AUC is the area under the ROC curve; 1.0 is perfect, 0.5 is no better than chance.

Sensitivity is also known as which rate?

  • The false positive rate
  • The specificity
  • The precision
  • The true positive rate (recall)

Answer: The true positive rate (recall). Sensitivity, recall, and the true positive rate all mean the fraction of actual positives correctly identified.

Which function from the pROC package builds an ROC object?

  • rocplot()
  • make_roc()
  • roc()
  • auc.curve()

Answer: roc(). pROC::roc() takes the true labels and predicted scores and returns an ROC object.

How do you request 10-fold CV in caret's trainControl()?

  • trainControl(method = 'none')
  • trainControl(method = 'cv', number = 10)
  • trainControl(folds = 10)
  • trainControl(cv = TRUE)

Answer: trainControl(method = 'cv', number = 10). method = 'cv' with number = 10 tells caret to use 10-fold cross-validation.

Specificity measures what?

  • The fraction of actual negatives correctly identified
  • The fraction of positives found
  • The total accuracy
  • The area under the curve

Answer: The fraction of actual negatives correctly identified. Specificity is the true negative rate: the fraction of actual negatives correctly classified.