Modeling
Machine Learning with R
R
caret
rpart
randomForest
class
e1701
stats
factoextra
By Afshine Amidi and Shervine Amidi
Overview
There are several steps that are needed to build a machine learning model:
- feature engineering: building features that can be interpreted and that can have a high predictive power
- model selection: choosing a model that can generalize well on unseen data
- tracking performance metrics: quantifying how the model performs and include the uncertainty of predictions in the results
Data preparation
Resampling The following table summarizes the main sampling techniques that are used to correct the distribution of the data, which can be necessary when classes are imbalanced:
Action on | Command | Illustration | |
minority class | majority class | ||
(x, y) |
![]() |
||
Downsample | downSample(x, y) |
![]() |
|
Upsample with repetitions | upSample(x, y) |
![]() |
|
Upsample with SMOTE | SMOTE(form, data) |
![]() |
Relationship between variables The following command enables to display pairwise correlations as well as visual comparisons using the GGally
library:
ggpairs(data)
The matrix plot generated looks like the following:

where the representation at row $i$ and column $j$:
- gives the correlation number between variables $i$ and $j$ when $i < j$
- plots the probability distribution of variable $i$ when $i = j$
- plots the data in 2D where $x$- and $y$-axes are resp. variables $j$ and $i$ when $i > j$
Remark: this command is useful for variable selection decisions.
Scaling When features of a dataset are on different scales, it is sometimes recommended to standardize them. This operation can be done using the caret
library as follows:
# Create scaling based on reference dataset data_reference
scaling = preProcess(data_reference, method = c('center', 'scale'))
# Apply scaling to dataset data
data_scaled = predict(scaling, data)
Data splits The data can be randomly split into proportions $p$ and $1-p$ resp. for training and validation sets using the caret
library.

One can do so with the following commands:
# Create partition
train_partition = createDataPartition(y, p, list = FALSE)
# Split data into train and test sets
data_train = data[train_partition,]
data_val = data[-train_partition,]
Supervised models
In this part, we want to learn the relationship between the data and its labels. It can either be a regression (continuous output) or classification (multi-class output).
General methods
Linear regression Linear regression is a supervised learning method used for regression problems.

Given a data frame data
containing the independent variables x
and the dependent variable y
, along with a formula
of the type y ~ x
, we can build a linear regression model as follows:
model = lm(data, formula)
summary(model)
and the model object
can make predictions on a set of data newdata
with the following command:
predictions = predict(object, newdata, params)
where the parameters are summarized in the table below:
Parameter | Command | Description | Default |
Standard errors | se.fit |
Boolean indicating whether standard errors are displayed | FALSE |
Confidence interval | interval |
Type of confidence interval. Possible values are 'none' , 'confidence' , 'prediction' |
|
Confidence level | level |
Numeric value between 0 and 1 | 0.95 |
Remark: check if predictors are multicollinear. Can be used to interpret association between predictors and output.
Logistic regression Logistic regression is a supervised learning method for classification problems.
Given a data frame data
containing the independent variables x
and the dependent variable y
, along with a formula
of the type y ~ x
, we can build a logistic regression model as follows:
model = glm(data, formula, family = 'binomial')
k-NN The k-nearest neighbors algorithm, also known as k-NN, is a supervised method.

Given training features train
along with labels cl
as well as test features test
, we can compute predictions using the class
library with the following command:
predictions = knn(train, test, cl, params)
params
are presented below:
Parameter | Command | Description | Default |
Number of neighbors | k |
Typically set to 3 |
1 |
Probabilistic | prob |
Boolean indicating whether probabilistic predictions are displayed | FALSE |
SVM Support vector machines (SVM) is algorithm that can be used both for regression (SVR) and classification (SVC).

The following command creates a model using the e1701
library:
model = svm(data, formula, params)
Category | Parameter | Command | Description | Default |
General | Type of model | type |
'eps-regression' for regression |
|
'C-classification' for classification |
||||
Probability | probability |
Boolean that allows probability predictions (classification) | FALSE |
|
Kernel | kernel |
'linear' , 'polynomial' , 'radial' , 'sigmoid' |
'radial' |
|
Cost | cost |
Penalization of constraints violation | 1 |
|
Kernel specific | Degree | degree |
Required for a polynomial kernel | 3 |
$\gamma$ | gamma |
Required for polynomial, radial and sigmoid kernels | 1/ncol(x) |
|
Intercept | coef0 |
Required for polynomial and sigmoid kernels | 0 |
We can make predictions with:
predictions = predict(object, newdata, params)
where the parameters are summarized in the table below:
Parameter | Command | Description | Default |
Decision values | decision.values |
Boolean enabling intermediary computations to be returned in the multi-class classification setting | FALSE |
Probabilistic predictions | probability |
Boolean value enabling probabilities predictions | FALSE |
Decision tree Decision trees are one type of tree-based models that can be built for either classification or regression purposes.
The following command uses the rpart
library as follows:
model = rpart(data, formula, params)
Category | Parameter | Command | Description | Default |
General | Type of model | method |
'anova' for regression |
|
'class' for classification |
||||
Parameters | Sample split | minsplit |
Minimum number of observations in node before split is attempted | 20 |
Sample bucket | minbucket |
Minimum number of observations in terminal leaf | minsplit/3 |
|
Complexity parameter | cp |
Amount of relative error that needs to happen to allow an additional split, with small values leading to overfitting | 0.01 |
|
Maximum depth | maxdepth |
Maximum depth allowed in the final tree | 30 |
The resulting tree can be interpreted using commands coming from the rpart.plot
library, that are summarized in the table below:
Action | Command |
Draw tree structure | rpart.plot(model) |
Retrieve tree rules | rpart.rules(model) |
Ensemble models
Random forest Random forest is a tree-based ensemble model.
The following command uses the randomForest
library:
model = randomForest(data, formula, params)
Parameter | Command | Description | Default |
Number of trees | ntree |
Number of independent trees of the random forest | 500 |
Minimum leaf size | nodesize |
Minimum number of observations that can be in a tree leaf, where the higher the value, the smaller the tree | 5 |
1 |
|||
Number of sample predictors | mtry |
Number of predictors considered for each independent tree | ncol(x)/3 |
sqrt(ncol(x)) |
|||
Sampling replacement | replace |
Boolean value that indicates whether sampling of cases is done with replacement | TRUE |
Class weight | classwt |
Parameter only used in classification | NULL |
We can make predictions as follows:
predictions = predict(model, newdata, params)
where the parameters are summarized in the table below:
Parameter | Command | Description | Default |
Output type | type |
Choose output type to be predicted values ('response' ), matrix of class probabilities ('prob' ) or matrix of vote counts ('votes' ) |
'response' |
Predictions | predict.all |
Boolean indicating whether all tree predictions are kept in memory | FALSE |
XGBoost Gradient boosting models can be trained using the xgboost
library.
First the data needs to be in DMatrix
format, which can be done with the following command:
data_xgb = xgb.DMatrix(data, label)
model = xgb.train(data, label, params)
Category | Parameter | Command | Description | Default |
General | Type of model | objective |
'reg:linear' for regression |
'reg:linear' |
'binary:logistic' , multi:softmax , multi:softprob for classification |
||||
Type of booster | booster |
'gbtree' for tree-based boosters, 'gblinear' for linear-based boosters |
'gbtree' |
|
Number of iterations | nrounds |
Typically set to 50 |
||
Tree-based | Depth of trees | max.depth |
Maximum depth of trees | 6 |
Step size | eta |
Step size of each boosting step | 0.3 |
|
Linear | L1 regularization | alpha |
LASSO coefficient | 0 |
L2 regularization | lambda |
Ridge coefficients | 0 |
|
lambda_bias |
0 |
|||
Performance | Evaluation set | watchlist |
List of DMatrix used for metrics generation |
|
Metric | eval_metric |
'rmse' , 'mae' for regression |
||
'error' , 'logloss' , 'mlogloss' , 'auc' for classification |
Predictions are made as follows:
predict(object, newdata, params)
Model performance
Cross-validation Cross-validation is a technique used to find the optimal hyperparameters of a model by making sure that it does not overfit the training set. It divides the dataset into $k$ splits, and trains the model on $k-1$ splits while making predictions on the $k^{th}$ one.

This can be achieved with the caret
library that will fit the resulting model on the training set with the cross-validated optimal hyperparmeters.
First, the training setting is defined as follows:
# Set partitions
train_control = trainControl(params)
where the parameters are summarized in the table below:
Parameter | Command | Description |
Sampling | method |
Examples include 'boot' , 'cv' , 'repeatedcv' |
Folds | number |
Number of folds |
Repeats | repeats |
Number of times to perform cross validation |
Then, a model can be fit using the control parameters as follows:
# Find optimal parameters with cross-validation
model = train(data, formula, trControl = train_control, method, metric, maximize, tuneGrid)
The parameters are summed up in the table below:
Parameter | Command | Description | Default |
Type of model | method |
Examples include 'glmnet' , 'rf' , 'svmLinearWeights' , 'rpart' |
'rf' |
Grid search | tuneGrid |
expand.grid(parameters = range) |
NULL |
trControl |
trainControl(params) |
trainControl() |
|
Metric | metric |
'RMSE' , 'R2' , 'MAE' for regression |
'RMSE' |
'Accuracy' , 'ROC' , 'F1' for classification |
'Accuracy' |
||
maximize |
Boolean indicating whether to maximize the chosen metric | FALSE |
|
TRUE |
Regression metrics The table below summarizes the main metrics used in regression problems, based on predicted pred
and actuals act
, by using the caret
library:
Metric | Command | Definition | Interpretation |
RMSE | RMSE(pred, act) |
$\displaystyle\sqrt{\textrm{MSE}}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y_i})^2}$ | Root mean square error |
MAE | MAE(pred, act) |
$\displaystyle\frac{1}{n}\sum_{i=1}^n|y_i-\widehat{y_i}|$ | Mean average error |
$R^2$ | R2(
pred, act,
form = 'traditional'
) |
$\displaystyle1-\frac{\textrm{SSE}}{\textrm{SST}}$ with sum of squares total ($\textrm{SST}$) and errors ($\textrm{SSE}$) defined as:
|
Proportion of the variance that is explained by the model |
Classification metrics The table below summarizes the main metrics used in classification problems, that can be accessed with the caret
, ModelMetrics
and the PRROC
libraries:
Metric | Command | Definition | Interpretation |
Confusion matrix | confusionMatrix(act, pred) |
Matrix of $\textrm{TP}$, $\textrm{FP}$, $\textrm{TN}$, $\textrm{FN}$ | Granular view of model performance |
Accuracy | mean(act == pred) |
$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}$ | How accurate positive predictions are |
Precision | precision(act, pred) |
$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}$ | How accurate positive predictions are |
Recall Sensitivity $\textrm{TPR}$ |
recall(act, pred) |
$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}$ | Coverage of actual positive samples |
Specificity $\textrm{TNR}$ |
specificity(act, pred) |
$\displaystyle\frac{\textrm{TN}}{\textrm{TN}+\textrm{FP}}$ | Coverage of actual negative samples |
F1 score | f1Score(act, pred) |
$\displaystyle\frac{2\textrm{TP}}{2\textrm{TP}+\textrm{FP}+\textrm{FN}}$ | Harmonic mean of precision and recall |
Matthew's correlation coefficient | mcc(act, pred, cutoff) |
$\frac{\textrm{TP}\times\textrm{TN}-\textrm{FP}\times\textrm{FN}}{\sqrt{(\textrm{TP}+\textrm{FP})(\textrm{TP}+\textrm{FN})(\textrm{TN}+\textrm{FP})(\textrm{TN}+\textrm{FN})}}$ | Hybrid metric between -1 and 1, useful for unbalanced classes |
Receiving operating curve | obj = roc.curve(pred, act) plot(obj) |
Plot with $\textrm{TPR}$ wrt $\textrm{FPR}$ for different cutoff values | Identity line is a random guess classifier. |
AUC | auc(act, pred) |
Area under the curve that plots $\textrm{TPR}$ wrt $\textrm{FPR}$ | Value between 0 and 1, where 0.5 is equivalent of a classifier that makes random predictions. |
Unsupervised models
In this part, we do not have true labels, and we want to get insights from the data.
k-means The $k$-means algorithm can be performed using the base stats
library.

By taking as input a matrix of features x
as well as additional parameters, we have:
model = kmeans(x, params)
Parameter | Command | Description | Default |
Clusters | centers |
Number of clusters | |
Iterations | iter.max |
Maximum number of iterations until final clusters | 10 |
Initialization | nstart |
Number of initial configurations that are tried | 1 |
The attributes of the model can then be accessed with the dimensions summarized below:
Dimension | Command | Description |
Clusters | cluster |
Number of clusters |
Center | center |
Center of each cluster |
Size | size |
Size of each cluster |
The elbow method along with the distortion metric is used to select the number of clusters that make the most sense.
Remark: for reproducibility purposes, it is important to the seed to a fixed value with the set.seed()
command.
Hierarchical clustering The hierarchical clustering algorithm is an unsupervised technique meant at forming clusters of observations that look alike based on their distances. It can be performed using the base stats
library:
# Dissimilarity matrix
df_distance = dist(data, method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc = hclust(d, method = "complete")
PCA Principal components analysis, commonly known as PCA, is a dimension reduction technique that can be used as a data preparation step to reduce the number of variables.

It is done using the prcomp
function on a matrix x
as follows:
model = prcomp(x, params)
where the parameters are summarized below:
Parameter | Command | Description | Default |
Centering | center |
Boolean that indicates whether input should be centered around zero | TRUE |
Scaling | scale |
Boolean that indicates whether input should be scaled | FALSE |
A scree plot is commonly used to check how much of the variance is explained by the newly-created principal components. The table below summarizes the main actions that can be done using the factoextra
library: