Modeling

Machine Learning with R

R caret rpart randomForest class e1701 stats factoextra

Overview

There are several steps that are needed to build a machine learning model:

feature engineering: building features that can be interpreted and that can have a high predictive power
model selection: choosing a model that can generalize well on unseen data
tracking performance metrics: quantifying how the model performs and include the uncertainty of predictions in the results

Data preparation

Resampling The following table summarizes the main sampling techniques that are used to correct the distribution of the data, which can be necessary when classes are imbalanced:

Action on		Command	Illustration
minority class	majority class	Command	Illustration
		`(x, y)`
	Downsample	`downSample(x, y)`
Upsample with repetitions		`upSample(x, y)`
Upsample with SMOTE		`SMOTE(form, data)`

Relationship between variables The following command enables to display pairwise correlations as well as visual comparisons using the GGally library:

ggpairs(data)

The matrix plot generated looks like the following:

where the representation at row $i$ and column $j$:

gives the correlation number between variables $i$ and $j$ when $i < j$
plots the probability distribution of variable $i$ when $i = j$
plots the data in 2D where $x$- and $y$-axes are resp. variables $j$ and $i$ when $i > j$

Remark: this command is useful for variable selection decisions.

Scaling When features of a dataset are on different scales, it is sometimes recommended to standardize them. This operation can be done using the caret library as follows:

# Create scaling based on reference dataset data_reference
scaling = preProcess(data_reference, method = c('center', 'scale'))

# Apply scaling to dataset data
data_scaled = predict(scaling, data)

Data splits The data can be randomly split into proportions $p$ and $1-p$ resp. for training and validation sets using the caret library.

One can do so with the following commands:

# Create partition
train_partition = createDataPartition(y, p, list = FALSE)

# Split data into train and test sets
data_train = data[train_partition,]
data_val = data[-train_partition,]

Supervised models

In this part, we want to learn the relationship between the data and its labels. It can either be a regression (continuous output) or classification (multi-class output).

General methods

Linear regression Linear regression is a supervised learning method used for regression problems.

Given a data frame data containing the independent variables x and the dependent variable y, along with a formula of the type y ~ x, we can build a linear regression model as follows:

model = lm(data, formula)

An overview of the model can be seen with summary(model) and the model object can make predictions on a set of data newdata with the following command:

predictions = predict(object, newdata, params)

where the parameters are summarized in the table below:

Parameter	Command	Description	Default
Standard errors	`se.fit`	Boolean indicating whether standard errors are displayed	`FALSE`
Confidence interval	`interval`	Type of confidence interval. Possible values are `'none'`, `'confidence'`, `'prediction'`
Confidence level	`level`	Numeric value between 0 and 1	`0.95`

Remark: check if predictors are multicollinear. Can be used to interpret association between predictors and output.

Logistic regression Logistic regression is a supervised learning method for classification problems.

Given a data frame data containing the independent variables x and the dependent variable y, along with a formula of the type y ~ x, we can build a logistic regression model as follows:

model = glm(data, formula, family = 'binomial')

k-NN The k-nearest neighbors algorithm, also known as k-NN, is a supervised method.

Given training features train along with labels cl as well as test features test, we can compute predictions using the class library with the following command:

predictions = knn(train, test, cl, params)

where the hyperparameters customizable by params are presented below:

Parameter	Command	Description	Default
Number of neighbors	`k`	Typically set to `3`	`1`
Probabilistic	`prob`	Boolean indicating whether probabilistic predictions are displayed	`FALSE`

SVM Support vector machines (SVM) is algorithm that can be used both for regression (SVR) and classification (SVC).

The following command creates a model using the e1701 library:

model = svm(data, formula, params)

where the parameters are summarized below:

Category	Parameter	Command	Description	Default
General	Type of model	`type`	`'eps-regression'` for regression
	Type of model	`type`	`'C-classification'` for classification
	Probability	`probability`	Boolean that allows probability predictions (classification)	`FALSE`
	Kernel	`kernel`	`'linear'`, `'polynomial'`, `'radial'`, `'sigmoid'`	`'radial'`
	Cost	`cost`	Penalization of constraints violation	`1`
Kernel specific	Degree	`degree`	Required for a polynomial kernel	`3`
	$\gamma$	`gamma`	Required for polynomial, radial and sigmoid kernels	`1/ncol(x)`
	Intercept	`coef0`	Required for polynomial and sigmoid kernels	`0`

We can make predictions with:

predictions = predict(object, newdata, params)

where the parameters are summarized in the table below:

Parameter	Command	Description	Default
Decision values	`decision.values`	Boolean enabling intermediary computations to be returned in the multi-class classification setting	`FALSE`
Probabilistic predictions	`probability`	Boolean value enabling probabilities predictions	`FALSE`

Decision tree Decision trees are one type of tree-based models that can be built for either classification or regression purposes.

The following command uses the rpart library as follows:

model = rpart(data, formula, params)

The parameters are summarized in the table below:

Category	Parameter	Command	Description	Default
General	Type of model	`method`	`'anova'` for regression
General	Type of model	`method`	`'class'` for classification
Parameters	Sample split	`minsplit`	Minimum number of observations in node before split is attempted	`20`
	Sample bucket	`minbucket`	Minimum number of observations in terminal leaf	`minsplit/3`
	Complexity parameter	`cp`	Amount of relative error that needs to happen to allow an additional split, with small values leading to overfitting	`0.01`
	Maximum depth	`maxdepth`	Maximum depth allowed in the final tree	`30`

The resulting tree can be interpreted using commands coming from the rpart.plot library, that are summarized in the table below:

Action	Command
Draw tree structure	`rpart.plot(model)`
Retrieve tree rules	`rpart.rules(model)`

Ensemble models

Random forest Random forest is a tree-based ensemble model.

The following command uses the randomForest library:

model = randomForest(data, formula, params)

where the parameters are summarized below:

Parameter	Command	Description	Default
Number of trees	`ntree`	Number of independent trees of the random forest	`500`
Minimum leaf size	`nodesize`	Minimum number of observations that can be in a tree leaf, where the higher the value, the smaller the tree	`5`
Minimum leaf size	`nodesize`		`1`
Number of sample predictors	`mtry`	Number of predictors considered for each independent tree	`ncol(x)/3`
Number of sample predictors	`mtry`	Number of predictors considered for each independent tree	`sqrt(ncol(x))`
Sampling replacement	`replace`	Boolean value that indicates whether sampling of cases is done with replacement	`TRUE`
Class weight	`classwt`	Parameter only used in classification	`NULL`

We can make predictions as follows:

predictions = predict(model, newdata, params)

where the parameters are summarized in the table below:

Parameter	Command	Description	Default
Output type	`type`	Choose output type to be predicted values (`'response'`), matrix of class probabilities (`'prob'`) or matrix of vote counts (`'votes'`)	`'response'`
Predictions	`predict.all`	Boolean indicating whether all tree predictions are kept in memory	`FALSE`

XGBoost Gradient boosting models can be trained using the xgboost library.

First the data needs to be in DMatrix format, which can be done with the following command:

data_xgb = xgb.DMatrix(data, label)

The model can then be trained as follows:

model = xgb.train(data, label, params)

where the parameters are summarized below:

Category	Parameter	Command	Description	Default
General	Type of model	`objective`	`'reg:linear'` for regression	`'reg:linear'`
	Type of model	`objective`	`'binary:logistic'`, `multi:softmax`, `multi:softprob` for classification	`'reg:linear'`
	Type of booster	`booster`	`'gbtree'` for tree-based boosters, `'gblinear'` for linear-based boosters	`'gbtree'`
	Number of iterations	`nrounds`	Typically set to `50`
Tree-based	Depth of trees	`max.depth`	Maximum depth of trees	`6`
Tree-based	Step size	`eta`	Step size of each boosting step	`0.3`
Linear	L1 regularization	`alpha`	LASSO coefficient	`0`
	L2 regularization	`lambda`	Ridge coefficients	`0`
	L2 regularization	`lambda_bias`	Ridge coefficients	`0`
Performance	Evaluation set	`watchlist`	List of `DMatrix` used for metrics generation
	Metric	`eval_metric`	`'rmse'`, `'mae'` for regression
	Metric	`eval_metric`	`'error'`, `'logloss'`, `'mlogloss'`, `'auc'` for classification

Predictions are made as follows:

predict(object, newdata, params)

Model performance

Cross-validation Cross-validation is a technique used to find the optimal hyperparameters of a model by making sure that it does not overfit the training set. It divides the dataset into $k$ splits, and trains the model on $k-1$ splits while making predictions on the $k^{th}$ one.

This can be achieved with the caret library that will fit the resulting model on the training set with the cross-validated optimal hyperparmeters.

First, the training setting is defined as follows:

# Set partitions
train_control = trainControl(params)

where the parameters are summarized in the table below:

Parameter	Command	Description
Sampling	`method`	Examples include `'boot'`, `'cv'`, `'repeatedcv'`
Folds	`number`	Number of folds
Repeats	`repeats`	Number of times to perform cross validation

Then, a model can be fit using the control parameters as follows:

# Find optimal parameters with cross-validation
model = train(data, formula, trControl = train_control, method, metric, maximize, tuneGrid)

The parameters are summed up in the table below:

Parameter	Command	Description	Default
Type of model	`method`	Examples include `'glmnet'`, `'rf'`, `'svmLinearWeights'`, `'rpart'`	`'rf'`
Grid search	`tuneGrid`	`expand.grid(parameters = range)`	`NULL`
Grid search	`trControl`	`trainControl(params)`	`trainControl()`
Metric	`metric`	`'RMSE'`, `'R2'`, `'MAE'` for regression	`'RMSE'`
	`metric`	`'Accuracy'`, `'ROC'`, `'F1'` for classification	`'Accuracy'`
	`maximize`	Boolean indicating whether to maximize the chosen metric	`FALSE`
	`maximize`	Boolean indicating whether to maximize the chosen metric	`TRUE`

Regression metrics The table below summarizes the main metrics used in regression problems, based on predicted pred and actuals act, by using the caret library:

Metric	Command	Definition	Interpretation
RMSE	`RMSE(pred, act)`	$\displaystyle\sqrt{\textrm{MSE}}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y_i})^2}$	Root mean square error
MAE	`MAE(pred, act)`	$\displaystyle\frac{1}{n}\sum_{i=1}^n\|y_i-\widehat{y_i}\|$	Mean average error
$R^2$	`R2(` `pred, act,` `form = 'traditional'` `)`	$\displaystyle1-\frac{\textrm{SSE}}{\textrm{SST}}$ with sum of squares total ($\textrm{SST}$) and errors ($\textrm{SSE}$) defined as: $\displaystyle\textrm{SST}=\sum_{i=1}^n(y_i-\overline{y})^2$ $\displaystyle\textrm{SSE}=\sum_{i=1}^n(y_i-\widehat{y_i})^2$	Proportion of the variance that is explained by the model

Classification metrics The table below summarizes the main metrics used in classification problems, that can be accessed with the caret, ModelMetrics and the PRROC libraries:

Metric	Command	Definition	Interpretation
Confusion matrix	`confusionMatrix(act, pred)`	Matrix of $\textrm{TP}$, $\textrm{FP}$, $\textrm{TN}$, $\textrm{FN}$	Granular view of model performance
Accuracy	`mean(act == pred)`	$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}$	How accurate positive predictions are
Precision	`precision(act, pred)`	$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}$	How accurate positive predictions are
Recall Sensitivity $\textrm{TPR}$	`recall(act, pred)`	$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}$	Coverage of actual positive samples
Specificity $\textrm{TNR}$	`specificity(act, pred)`	$\displaystyle\frac{\textrm{TN}}{\textrm{TN}+\textrm{FP}}$	Coverage of actual negative samples
F1 score	`f1Score(act, pred)`	$\displaystyle\frac{2\textrm{TP}}{2\textrm{TP}+\textrm{FP}+\textrm{FN}}$	Harmonic mean of precision and recall
Matthew's correlation coefficient	`mcc(act, pred, cutoff)`	$\frac{\textrm{TP}\times\textrm{TN}-\textrm{FP}\times\textrm{FN}}{\sqrt{(\textrm{TP}+\textrm{FP})(\textrm{TP}+\textrm{FN})(\textrm{TN}+\textrm{FP})(\textrm{TN}+\textrm{FN})}}$	Hybrid metric between -1 and 1, useful for unbalanced classes
Receiving operating curve	`obj = roc.curve(pred, act)` `plot(obj)`	Plot with $\textrm{TPR}$ wrt $\textrm{FPR}$ for different cutoff values	Identity line is a random guess classifier.
AUC	`auc(act, pred)`	Area under the curve that plots $\textrm{TPR}$ wrt $\textrm{FPR}$	Value between 0 and 1, where 0.5 is equivalent of a classifier that makes random predictions.

Unsupervised models

In this part, we do not have true labels, and we want to get insights from the data.

k-means The $k$-means algorithm can be performed using the base stats library.

By taking as input a matrix of features x as well as additional parameters, we have:

model = kmeans(x, params)

where the parameters are summarized below:

Parameter	Command	Description	Default
Clusters	`centers`	Number of clusters
Iterations	`iter.max`	Maximum number of iterations until final clusters	`10`
Initialization	`nstart`	Number of initial configurations that are tried	`1`

The attributes of the model can then be accessed with the dimensions summarized below:

Dimension	Command	Description
Clusters	`cluster`	Number of clusters
Center	`center`	Center of each cluster
Size	`size`	Size of each cluster

The elbow method along with the distortion metric is used to select the number of clusters that make the most sense.

Remark: for reproducibility purposes, it is important to the seed to a fixed value with the set.seed() command.

Hierarchical clustering The hierarchical clustering algorithm is an unsupervised technique meant at forming clusters of observations that look alike based on their distances. It can be performed using the base stats library:

# Dissimilarity matrix
df_distance = dist(data, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc = hclust(d, method = "complete")

PCA Principal components analysis, commonly known as PCA, is a dimension reduction technique that can be used as a data preparation step to reduce the number of variables.

It is done using the prcomp function on a matrix x as follows:

model = prcomp(x, params)

where the parameters are summarized below:

Parameter	Command	Description	Default
Centering	`center`	Boolean that indicates whether input should be centered around zero	`TRUE`
Scaling	`scale`	Boolean that indicates whether input should be scaled	`FALSE`

A scree plot is commonly used to check how much of the variance is explained by the newly-created principal components. The table below summarizes the main actions that can be done using the factoextra library: