Modeling


Machine Learning with R

R caret rpart randomForest class e1701 stats factoextra

By Afshine Amidi and Shervine Amidi

Overview

There are several steps that are needed to build a machine learning model:


Data preparation

Resampling The following table summarizes the main sampling techniques that are used to correct the distribution of the data, which can be necessary when classes are imbalanced:

Action on Command Illustration
minority class majority class
(x, y) Initial situation
Downsample downSample(x, y) Downsampling
Upsample with repetitions upSample(x, y) Upsampling
Upsample with SMOTE SMOTE(form, data) SMOTE

Relationship between variables The following command enables to display pairwise correlations as well as visual comparisons using the GGally library:

ggpairs(data)

The matrix plot generated looks like the following:

where the representation at row $i$ and column $j$:

Remark: this command is useful for variable selection decisions.


Scaling When features of a dataset are on different scales, it is sometimes recommended to standardize them. This operation can be done using the caret library as follows:

# Create scaling based on reference dataset data_reference
scaling = preProcess(data_reference, method = c('center''scale'))

# Apply scaling to dataset data
data_scaled = predict(scaling, data)

Data splits The data can be randomly split into proportions $p$ and $1-p$ resp. for training and validation sets using the caret library.

Partition of the dataset

One can do so with the following commands:

# Create partition
train_partition = createDataPartition(y, p, list = FALSE)

# Split data into train and test sets
data_train = data[train_partition,]
data_val = data[-train_partition,]

Supervised models

In this part, we want to learn the relationship between the data and its labels. It can either be a regression (continuous output) or classification (multi-class output).

General methods

Linear regression Linear regression is a supervised learning method used for regression problems.

Linear regression

Given a data frame data containing the independent variables x and the dependent variable y, along with a formula of the type y ~ x, we can build a linear regression model as follows:

model = lm(data, formula)
An overview of the model can be seen with summary(model) and the model object can make predictions on a set of data newdata with the following command:

predictions = predict(object, newdata, params)

where the parameters are summarized in the table below:

Parameter Command Description Default
Standard errors se.fit Boolean indicating whether standard errors are displayed FALSE
Confidence interval interval Type of confidence interval. Possible values are 'none', 'confidence', 'prediction'
Confidence level level Numeric value between 0 and 1 0.95

Remark: check if predictors are multicollinear. Can be used to interpret association between predictors and output.


Logistic regression Logistic regression is a supervised learning method for classification problems.

Given a data frame data containing the independent variables x and the dependent variable y, along with a formula of the type y ~ x, we can build a logistic regression model as follows:

model = glm(data, formula, family = 'binomial')

k-NN The k-nearest neighbors algorithm, also known as k-NN, is a supervised method.

k nearest neighbors

Given training features train along with labels cl as well as test features test, we can compute predictions using the class library with the following command:

predictions = knn(train, test, cl, params)
where the hyperparameters customizable by params are presented below:

Parameter Command Description Default
Number of neighbors k Typically set to 3 1
Probabilistic prob Boolean indicating whether probabilistic predictions are displayed FALSE

SVM Support vector machines (SVM) is algorithm that can be used both for regression (SVR) and classification (SVC).

SVM

The following command creates a model using the e1701 library:

model = svm(data, formula, params)
where the parameters are summarized below:

Category Parameter Command Description Default
General Type of model type 'eps-regression' for regression
'C-classification' for classification
Probability probability Boolean that allows probability predictions (classification) FALSE
Kernel kernel 'linear', 'polynomial', 'radial', 'sigmoid' 'radial'
Cost cost Penalization of constraints violation 1
Kernel specific Degree degree Required for a polynomial kernel 3
$\gamma$ gamma Required for polynomial, radial and sigmoid kernels 1/ncol(x)
Intercept coef0 Required for polynomial and sigmoid kernels 0

We can make predictions with:

predictions = predict(object, newdata, params)

where the parameters are summarized in the table below:

Parameter Command Description Default
Decision values decision.values Boolean enabling intermediary computations to be returned in the multi-class classification setting FALSE
Probabilistic predictions probability Boolean value enabling probabilities predictions FALSE

Decision tree Decision trees are one type of tree-based models that can be built for either classification or regression purposes.

The following command uses the rpart library as follows:

model = rpart(data, formula, params)
The parameters are summarized in the table below:

Category Parameter Command Description Default
General Type of model method 'anova' for regression
'class' for classification
Parameters Sample split minsplit Minimum number of observations in node before split is attempted 20
Sample bucket minbucket Minimum number of observations in terminal leaf minsplit/3
Complexity parameter cp Amount of relative error that needs to happen to allow an additional split, with small values leading to overfitting 0.01
Maximum depth maxdepth Maximum depth allowed in the final tree 30

The resulting tree can be interpreted using commands coming from the rpart.plot library, that are summarized in the table below:

Action Command
Draw tree structure rpart.plot(model)
Retrieve tree rules rpart.rules(model)

Ensemble models

Random forest Random forest is a tree-based ensemble model.

The following command uses the randomForest library:

model = randomForest(data, formula, params)
where the parameters are summarized below:

Parameter Command Description Default
Number of trees ntree Number of independent trees of the random forest 500
Minimum leaf size nodesize Minimum number of observations that can be in a tree leaf, where the higher the value, the smaller the tree 5
1
Number of sample predictors mtry Number of predictors considered for each independent tree ncol(x)/3
sqrt(ncol(x))
Sampling replacement replace Boolean value that indicates whether sampling of cases is done with replacement TRUE
Class weight classwt Parameter only used in classification NULL

We can make predictions as follows:

predictions = predict(model, newdata, params)

where the parameters are summarized in the table below:

Parameter Command Description Default
Output type type Choose output type to be predicted values ('response'), matrix of class probabilities ('prob') or matrix of vote counts ('votes') 'response'
Predictions predict.all Boolean indicating whether all tree predictions are kept in memory FALSE

XGBoost Gradient boosting models can be trained using the xgboost library.

First the data needs to be in DMatrix format, which can be done with the following command:

data_xgb = xgb.DMatrix(data, label)
The model can then be trained as follows:
model = xgb.train(data, label, params)
where the parameters are summarized below:

Category Parameter Command Description Default
General Type of model objective 'reg:linear' for regression 'reg:linear'
'binary:logistic', multi:softmax, multi:softprob for classification
Type of booster booster 'gbtree' for tree-based boosters, 'gblinear' for linear-based boosters 'gbtree'
Number of iterations nrounds Typically set to 50
Tree-based Depth of trees max.depth Maximum depth of trees 6
Step size eta Step size of each boosting step 0.3
Linear L1 regularization alpha LASSO coefficient 0
L2 regularization lambda Ridge coefficients 0
lambda_bias 0
Performance Evaluation set watchlist List of DMatrix used for metrics generation
Metric eval_metric 'rmse', 'mae' for regression
'error', 'logloss', 'mlogloss', 'auc' for classification

Predictions are made as follows:

predict(object, newdata, params)

Model performance

Cross-validation Cross-validation is a technique used to find the optimal hyperparameters of a model by making sure that it does not overfit the training set. It divides the dataset into $k$ splits, and trains the model on $k-1$ splits while making predictions on the $k^{th}$ one.

Cross-validation

This can be achieved with the caret library that will fit the resulting model on the training set with the cross-validated optimal hyperparmeters.

First, the training setting is defined as follows:

# Set partitions
train_control = trainControl(params)

where the parameters are summarized in the table below:

Parameter Command Description
Sampling method Examples include 'boot', 'cv', 'repeatedcv'
Folds number Number of folds
Repeats repeats Number of times to perform cross validation

Then, a model can be fit using the control parameters as follows:

# Find optimal parameters with cross-validation
model = train(data, formula, trControl = train_control, method, metric, maximize, tuneGrid)

The parameters are summed up in the table below:

Parameter Command Description Default
Type of model method Examples include 'glmnet', 'rf', 'svmLinearWeights', 'rpart' 'rf'
Grid search tuneGrid expand.grid(parameters = range) NULL
trControl trainControl(params) trainControl()
Metric metric 'RMSE', 'R2', 'MAE' for regression 'RMSE'
'Accuracy', 'ROC', 'F1' for classification 'Accuracy'
maximize Boolean indicating whether to maximize the chosen metric FALSE
TRUE

Regression metrics The table below summarizes the main metrics used in regression problems, based on predicted pred and actuals act, by using the caret library:

Metric Command Definition Interpretation
RMSE RMSE(pred, act) $\displaystyle\sqrt{\textrm{MSE}}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y_i})^2}$ Root mean square error
MAE MAE(pred, act) $\displaystyle\frac{1}{n}\sum_{i=1}^n|y_i-\widehat{y_i}|$ Mean average error
$R^2$ R2(
  pred, act,
  form = 'traditional'
)
$\displaystyle1-\frac{\textrm{SSE}}{\textrm{SST}}$
with sum of squares total ($\textrm{SST}$) and errors ($\textrm{SSE}$) defined as:
  • $\displaystyle\textrm{SST}=\sum_{i=1}^n(y_i-\overline{y})^2$
  • $\displaystyle\textrm{SSE}=\sum_{i=1}^n(y_i-\widehat{y_i})^2$
Proportion of the variance that is explained by the model

Classification metrics The table below summarizes the main metrics used in classification problems, that can be accessed with the caret, ModelMetrics and the PRROC libraries:

Metric Command Definition Interpretation
Confusion matrix confusionMatrix(act, pred) Matrix of $\textrm{TP}$, $\textrm{FP}$, $\textrm{TN}$, $\textrm{FN}$ Granular view of model performance
Accuracy mean(act == pred) $\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}$ How accurate positive predictions are
Precision precision(act, pred) $\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}$ How accurate positive predictions are
Recall
Sensitivity
$\textrm{TPR}$
recall(act, pred) $\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}$ Coverage of actual positive samples
Specificity
$\textrm{TNR}$
specificity(act, pred) $\displaystyle\frac{\textrm{TN}}{\textrm{TN}+\textrm{FP}}$ Coverage of actual negative samples
F1 score f1Score(act, pred) $\displaystyle\frac{2\textrm{TP}}{2\textrm{TP}+\textrm{FP}+\textrm{FN}}$ Harmonic mean of precision and recall
Matthew's correlation coefficient mcc(act, pred, cutoff) $\frac{\textrm{TP}\times\textrm{TN}-\textrm{FP}\times\textrm{FN}}{\sqrt{(\textrm{TP}+\textrm{FP})(\textrm{TP}+\textrm{FN})(\textrm{TN}+\textrm{FP})(\textrm{TN}+\textrm{FN})}}$ Hybrid metric between -1 and 1, useful for unbalanced classes
Receiving operating curve obj = roc.curve(pred, act)
plot(obj)
Plot with $\textrm{TPR}$ wrt $\textrm{FPR}$ for different cutoff values Identity line is a random guess classifier.
AUC auc(act, pred) Area under the curve that plots $\textrm{TPR}$ wrt $\textrm{FPR}$ Value between 0 and 1, where 0.5 is equivalent of a classifier that makes random predictions.

Unsupervised models

In this part, we do not have true labels, and we want to get insights from the data.

k-means The $k$-means algorithm can be performed using the base stats library.

k-means

By taking as input a matrix of features x as well as additional parameters, we have:

model = kmeans(x, params)
where the parameters are summarized below:

Parameter Command Description Default
Clusters centers Number of clusters
Iterations iter.max Maximum number of iterations until final clusters 10
Initialization nstart Number of initial configurations that are tried 1

The attributes of the model can then be accessed with the dimensions summarized below:

Dimension Command Description
Clusters cluster Number of clusters
Center center Center of each cluster
Size size Size of each cluster

The elbow method along with the distortion metric is used to select the number of clusters that make the most sense.

Remark: for reproducibility purposes, it is important to the seed to a fixed value with the set.seed() command.


Hierarchical clustering The hierarchical clustering algorithm is an unsupervised technique meant at forming clusters of observations that look alike based on their distances. It can be performed using the base stats library:

# Dissimilarity matrix
df_distance = dist(data, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc = hclust(d, method = "complete")

PCA Principal components analysis, commonly known as PCA, is a dimension reduction technique that can be used as a data preparation step to reduce the number of variables.

PCA

It is done using the prcomp function on a matrix x as follows:

model = prcomp(x, params)

where the parameters are summarized below:

Parameter Command Description Default
Centering center Boolean that indicates whether input should be centered around zero TRUE
Scaling scale Boolean that indicates whether input should be scaled FALSE

A scree plot is commonly used to check how much of the variance is explained by the newly-created principal components. The table below summarizes the main actions that can be done using the factoextra library: