Machine Learning with caret
caret gives you a single, consistent interface to hundreds of machine-learning algorithms in R — handling data splitting, preprocessing, resampling, tuning, and evaluation through one workflow.
Learn Machine Learning with caret in our free R course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a quick reference.
Part of the free R course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
By the end of this lesson you'll split data with createDataPartition(), fit and tune models with train() and trainControl(), and score them with confusionMatrix().
What You'll Learn in This Lesson
1️⃣ Splitting Data with createDataPartition()
Before modelling, hold out a test set the model never sees. createDataPartition() makes a stratified split, keeping each class's proportion balanced across train and test.
2️⃣ Fitting with train() and trainControl()
trainControl() says how to resample (here, 5-fold cross-validation); train() does the fitting and reports resampled accuracy. preProcess standardizes predictors first.
3️⃣ Tuning Hyperparameters
A tuneGrid lists candidate hyperparameter values. train() fits every combination under the resampling scheme and keeps the best in $bestTune .
4️⃣ Evaluating with confusionMatrix()
predict() applies the trained model to the held-out test set, and confusionMatrix() reports accuracy, sensitivity, specificity, and kappa.
Your turn. Fill in the # TODO blank, run it, and compare with the expected output.
Train a kNN and a tree on the same split, then compare their test accuracies. Swapping methods while keeping everything else fixed is exactly what caret makes painless.
📋 Quick Reference — caret
Practice quiz
What is the main function in caret for fitting a model?
- train()
- model()
- learn()
- fit()
Answer: train(). train() is caret's unified interface for fitting and tuning hundreds of model types.
Which caret function sets resampling and control options for train()?
- resample()
- tuneControl()
- trainControl()
- controlSet()
Answer: trainControl(). trainControl() configures the resampling method, number of folds, and other control settings.
What does createDataPartition() do?
- Creates new columns
- Splits data into train/test with balanced outcome classes
- Deletes missing rows
- Centers and scales predictors
Answer: Splits data into train/test with balanced outcome classes. createDataPartition() makes a stratified train/test split so class proportions are preserved.
Which preProcess steps standardize predictors in caret?
- log and exp
- sort and filter
- merge and join
- center and scale
Answer: center and scale. preProcess = c('center', 'scale') subtracts the mean and divides by the standard deviation.
What does the tuneGrid argument to train() control?
- The candidate hyperparameter values to try
- The number of CPU cores
- The random seed only
- The plotting theme
Answer: The candidate hyperparameter values to try. tuneGrid supplies a data frame of hyperparameter combinations caret evaluates during tuning.
What does confusionMatrix() report for a classifier?
- A correlation matrix
- Accuracy, sensitivity, specificity and related metrics
- The model coefficients
- Only the accuracy
Answer: Accuracy, sensitivity, specificity and related metrics. confusionMatrix() cross-tabulates predictions vs truth and computes accuracy, sensitivity, specificity, kappa, and more.
In trainControl(method = 'cv', number = 10), what does number specify?
- The number of trees
- The number of repeats
- The number of predictors
- The number of cross-validation folds
Answer: The number of cross-validation folds. With method = 'cv', number is the count of folds, so 10 means 10-fold cross-validation.
How do you generate predictions from a trained caret model?
- output(model)
- guess(model)
- predict(model, newdata)
- forecast(model)
Answer: predict(model, newdata). predict() applies the fitted model to newdata to return class labels or probabilities.
Which of these is a modern alternative framework to caret?
- lubridate
- tidymodels
- ggplot2
- knitr
Answer: tidymodels. tidymodels (and mlr3) are modern successors to caret for modeling workflows in R.
Why does caret resample during training?
- To estimate out-of-sample performance and pick hyperparameters
- To remove all NA values
- To convert factors to numbers
- To shuffle column order
Answer: To estimate out-of-sample performance and pick hyperparameters. Resampling like cross-validation estimates how the model generalizes and guides hyperparameter selection.