Machine Learning with caret

caret gives you a single, consistent interface to hundreds of machine-learning algorithms in R — handling data splitting, preprocessing, resampling, tuning, and evaluation through one workflow.

Learn Machine Learning with caret in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.

Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

By the end of this lesson you'll split data with createDataPartition(), fit and tune models with train() and trainControl(), and score them with confusionMatrix().

What You'll Learn in This Lesson

1️⃣ Splitting Data with createDataPartition()

Before modelling, hold out a test set the model never sees. createDataPartition() makes a stratified split, keeping each class's proportion balanced across train and test.

2️⃣ Fitting with train() and trainControl()

trainControl() says how to resample (here, 5-fold cross-validation); train() does the fitting and reports resampled accuracy. preProcess standardizes predictors first.

3️⃣ Tuning Hyperparameters

A tuneGrid lists candidate hyperparameter values. train() fits every combination under the resampling scheme and keeps the best in $bestTune .

4️⃣ Evaluating with confusionMatrix()

predict() applies the trained model to the held-out test set, and confusionMatrix() reports accuracy, sensitivity, specificity, and kappa.

Your turn. Fill in the # TODO blank, run it, and compare with the expected output.

Train a kNN and a tree on the same split, then compare their test accuracies. Swapping methods while keeping everything else fixed is exactly what caret makes painless.

📋 Quick Reference — caret

Practice quiz

What is the main function in caret for fitting a model?

train()
model()
learn()
fit()

Answer: train(). train() is caret's unified interface for fitting and tuning hundreds of model types.

Which caret function sets resampling and control options for train()?

resample()
tuneControl()
trainControl()
controlSet()

Answer: trainControl(). trainControl() configures the resampling method, number of folds, and other control settings.

What does createDataPartition() do?

Creates new columns
Splits data into train/test with balanced outcome classes
Deletes missing rows
Centers and scales predictors

Answer: Splits data into train/test with balanced outcome classes. createDataPartition() makes a stratified train/test split so class proportions are preserved.

Which preProcess steps standardize predictors in caret?

log and exp
sort and filter
merge and join
center and scale

Answer: center and scale. preProcess = c('center', 'scale') subtracts the mean and divides by the standard deviation.

What does the tuneGrid argument to train() control?

The candidate hyperparameter values to try
The number of CPU cores
The random seed only
The plotting theme

Answer: The candidate hyperparameter values to try. tuneGrid supplies a data frame of hyperparameter combinations caret evaluates during tuning.

What does confusionMatrix() report for a classifier?

A correlation matrix
Accuracy, sensitivity, specificity and related metrics
The model coefficients
Only the accuracy

Answer: Accuracy, sensitivity, specificity and related metrics. confusionMatrix() cross-tabulates predictions vs truth and computes accuracy, sensitivity, specificity, kappa, and more.

In trainControl(method = 'cv', number = 10), what does number specify?

The number of trees
The number of repeats
The number of predictors
The number of cross-validation folds

Answer: The number of cross-validation folds. With method = 'cv', number is the count of folds, so 10 means 10-fold cross-validation.

How do you generate predictions from a trained caret model?

output(model)
guess(model)
predict(model, newdata)
forecast(model)

Answer: predict(model, newdata). predict() applies the fitted model to newdata to return class labels or probabilities.

Which of these is a modern alternative framework to caret?

lubridate
tidymodels
ggplot2
knitr

Answer: tidymodels. tidymodels (and mlr3) are modern successors to caret for modeling workflows in R.

Why does caret resample during training?

To estimate out-of-sample performance and pick hyperparameters
To remove all NA values
To convert factors to numbers
To shuffle column order

Answer: To estimate out-of-sample performance and pick hyperparameters. Resampling like cross-validation estimates how the model generalizes and guides hyperparameter selection.