Supervised learning with R study guide

Overview

There are several steps that are needed to build a machine learning model:

feature engineering: building features that can be interpreted and that can have a high predictive power
model selection: choosing a model that can generalize well on unseen data
tracking performance metrics: quantifying how the model performs and include the uncertainty of predictions in the results

Supervised models

In this part, we want to learn the relationship between the data and its labels. It can either be a regression (continuous output) or classification (multi-class output).

General methods

Linear regression Linear regression is a supervised learning method used for regression problems.

Given a data frame data containing the independent variables x and the dependent variable y, along with a formula of the type y ~ x, we can build a linear regression model as follows:

model = lm(data, formula)

An overview of the model can be seen with summary(model) and the model object can make predictions on a set of data newdata with the following command:

predictions = predict(object, newdata, params)

where the parameters are summarized in the table below:

Parameter	Command	Description	Default
Standard errors	`se.fit`	Boolean indicating whether standard errors are displayed	`FALSE`
Confidence interval	`interval`	Type of confidence interval. Possible values are `'none'`, `'confidence'`, `'prediction'`
Confidence level	`level`	Numeric value between 0 and 1	`0.95`

Remark: check if predictors are multicollinear. Can be used to interpret association between predictors and output.

Logistic regression Logistic regression is a supervised learning method for classification problems.

Given a data frame data containing the independent variables x and the dependent variable y, along with a formula of the type y ~ x, we can build a logistic regression model as follows:

model = glm(data, formula, family = 'binomial')

k-NN The k-nearest neighbors algorithm, also known as k-NN, is a supervised method.

Given training features train along with labels cl as well as test features test, we can compute predictions using the class library with the following command:

predictions = knn(train, test, cl, params)

where the hyperparameters customizable by params are presented below:

Parameter	Command	Description	Default
Number of neighbors	`k`	Typically set to `3`	`1`
Probabilistic	`prob`	Boolean indicating whether probabilistic predictions are displayed	`FALSE`

SVM Support vector machines (SVM) is algorithm that can be used both for regression (SVR) and classification (SVC).

The following command creates a model using the e1701 library:

model = svm(data, formula, params)

where the parameters are summarized below:

Category	Parameter	Command	Description	Default
General	Type of model	`type`	`'eps-regression'` for regression
	Type of model	`type`	`'C-classification'` for classification
	Probability	`probability`	Boolean that allows probability predictions (classification)	`FALSE`
	Kernel	`kernel`	`'linear'`, `'polynomial'`, `'radial'`, `'sigmoid'`	`'radial'`
	Cost	`cost`	Penalization of constraints violation	`1`
Kernel specific	Degree	`degree`	Required for a polynomial kernel	`3`
	$\gamma$	`gamma`	Required for polynomial, radial and sigmoid kernels	`1/ncol(x)`
	Intercept	`coef0`	Required for polynomial and sigmoid kernels	`0`

We can make predictions with:

predictions = predict(object, newdata, params)

where the parameters are summarized in the table below:

Parameter	Command	Description	Default
Decision values	`decision.values`	Boolean enabling intermediary computations to be returned in the multi-class classification setting	`FALSE`
Probabilistic predictions	`probability`	Boolean value enabling probabilities predictions	`FALSE`

Decision tree Decision trees are one type of tree-based models that can be built for either classification or regression purposes.

The following command uses the rpart library as follows:

model = rpart(data, formula, params)

The parameters are summarized in the table below:

Category	Parameter	Command	Description	Default
General	Type of model	`method`	`'anova'` for regression
General	Type of model	`method`	`'class'` for classification
Parameters	Sample split	`minsplit`	Minimum number of observations in node before split is attempted	`20`
	Sample bucket	`minbucket`	Minimum number of observations in terminal leaf	`minsplit/3`
	Complexity parameter	`cp`	Amount of relative error that needs to happen to allow an additional split, with small values leading to overfitting	`0.01`
	Maximum depth	`maxdepth`	Maximum depth allowed in the final tree	`30`

The resulting tree can be interpreted using commands coming from the rpart.plot library, that are summarized in the table below:

Action	Command
Draw tree structure	`rpart.plot(model)`
Retrieve tree rules	`rpart.rules(model)`

Ensemble models

Random forest Random forest is a tree-based ensemble model.

The following command uses the randomForest library:

model = randomForest(data, formula, params)

where the parameters are summarized below:

Parameter	Command	Description	Default
Number of trees	`ntree`	Number of independent trees of the random forest	`500`
Minimum leaf size	`nodesize`	Minimum number of observations that can be in a tree leaf, where the higher the value, the smaller the tree	`5`
Minimum leaf size	`nodesize`		`1`
Number of sample predictors	`mtry`	Number of predictors considered for each independent tree	`ncol(x)/3`
Number of sample predictors	`mtry`	Number of predictors considered for each independent tree	`sqrt(ncol(x))`
Sampling replacement	`replace`	Boolean value that indicates whether sampling of cases is done with replacement	`TRUE`
Class weight	`classwt`	Parameter only used in classification	`NULL`

We can make predictions as follows:

predictions = predict(model, newdata, params)

where the parameters are summarized in the table below:

Parameter	Command	Description	Default
Output type	`type`	Choose output type to be predicted values (`'response'`), matrix of class probabilities (`'prob'`) or matrix of vote counts (`'votes'`)	`'response'`
Predictions	`predict.all`	Boolean indicating whether all tree predictions are kept in memory	`FALSE`

XGBoost Gradient boosting models can be trained using the xgboost library.

First the data needs to be in DMatrix format, which can be done with the following command:

data_xgb = xgb.DMatrix(data, label)

The model can then be trained as follows:

model = xgb.train(data, label, params)

where the parameters are summarized below:

Category	Parameter	Command	Description	Default
General	Type of model	`objective`	`'reg:linear'` for regression	`'reg:linear'`
	Type of model	`objective`	`'binary:logistic'`, `multi:softmax`, `multi:softprob` for classification	`'reg:linear'`
	Type of booster	`booster`	`'gbtree'` for tree-based boosters, `'gblinear'` for linear-based boosters	`'gbtree'`
	Number of iterations	`nrounds`	Typically set to `50`
Tree-based	Depth of trees	`max.depth`	Maximum depth of trees	`6`
Tree-based	Step size	`eta`	Step size of each boosting step	`0.3`
Linear	L1 regularization	`alpha`	LASSO coefficient	`0`
	L2 regularization	`lambda`	Ridge coefficients	`0`
	L2 regularization	`lambda_bias`	Ridge coefficients	`0`
Performance	Evaluation set	`watchlist`	List of `DMatrix` used for metrics generation
	Metric	`eval_metric`	`'rmse'`, `'mae'` for regression
	Metric	`eval_metric`	`'error'`, `'logloss'`, `'mlogloss'`, `'auc'` for classification

Predictions are made as follows:

predict(object, newdata, params)

Model performance

Cross-validation Cross-validation is a technique used to find the optimal hyperparameters of a model by making sure that it does not overfit the training set. It divides the dataset into $k$ splits, and trains the model on $k-1$ splits while making predictions on the $k^{th}$ one.

This can be achieved with the caret library that will fit the resulting model on the training set with the cross-validated optimal hyperparmeters.

First, the training setting is defined as follows:

# Set partitions
train_control = trainControl(params)

where the parameters are summarized in the table below:

Parameter	Command	Description
Sampling	`method`	Examples include `'boot'`, `'cv'`, `'repeatedcv'`
Folds	`number`	Number of folds
Repeats	`repeats`	Number of times to perform cross validation

Then, a model can be fit using the control parameters as follows:

# Find optimal parameters with cross-validation
model = train(data, formula, trControl = train_control, method, metric, maximize, tuneGrid)

The parameters are summed up in the table below:

Parameter	Command	Description	Default
Type of model	`method`	Examples include `'glmnet'`, `'rf'`, `'svmLinearWeights'`, `'rpart'`	`'rf'`
Grid search	`tuneGrid`	`expand.grid(parameters = range)`	`NULL`
Grid search	`trControl`	`trainControl(params)`	`trainControl()`
Metric	`metric`	`'RMSE'`, `'R2'`, `'MAE'` for regression	`'RMSE'`
	`metric`	`'Accuracy'`, `'ROC'`, `'F1'` for classification	`'Accuracy'`
	`maximize`	Boolean indicating whether to maximize the chosen metric	`FALSE`
	`maximize`	Boolean indicating whether to maximize the chosen metric	`TRUE`

Regression metrics The table below summarizes the main metrics used in regression problems, based on predicted pred and actuals act, by using the caret library:

Metric	Command	Definition	Interpretation
RMSE	`RMSE(pred, act)`	$\displaystyle\sqrt{\textrm{MSE}}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y_i})^2}$	Root mean square error
MAE	`MAE(pred, act)`	$\displaystyle\frac{1}{n}\sum_{i=1}^n\|y_i-\widehat{y_i}\|$	Mean average error
$R^2$	`R2(` `pred, act,` `form = 'traditional'` `)`	$\displaystyle1-\frac{\textrm{SSE}}{\textrm{SST}}$ with sum of squares total ($\textrm{SST}$) and errors ($\textrm{SSE}$) defined as: $\displaystyle\textrm{SST}=\sum_{i=1}^n(y_i-\overline{y})^2$ $\displaystyle\textrm{SSE}=\sum_{i=1}^n(y_i-\widehat{y_i})^2$	Proportion of the variance that is explained by the model

Classification metrics The table below summarizes the main metrics used in classification problems, that can be accessed with the caret, ModelMetrics and the PRROC libraries:

Metric	Command	Definition	Interpretation
Confusion matrix	`confusionMatrix(act, pred)`	Matrix of $\textrm{TP}$, $\textrm{FP}$, $\textrm{TN}$, $\textrm{FN}$	Granular view of model performance
Accuracy	`mean(act == pred)`	$\displaystyle\frac{\textrm{TP}+\textrm{TN}}{\textrm{TP}+\textrm{TN}+\textrm{FP}+\textrm{FN}}$	How accurate predictions are
Precision	`precision(act, pred)`	$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}$	How accurate positive predictions are
Recall Sensitivity $\textrm{TPR}$	`recall(act, pred)`	$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}$	Coverage of actual positive samples
Specificity $\textrm{TNR}$	`specificity(act, pred)`	$\displaystyle\frac{\textrm{TN}}{\textrm{TN}+\textrm{FP}}$	Coverage of actual negative samples
F1 score	`f1Score(act, pred)`	$\displaystyle\frac{2\textrm{TP}}{2\textrm{TP}+\textrm{FP}+\textrm{FN}}$	Harmonic mean of precision and recall
Matthew's correlation coefficient	`mcc(act, pred, cutoff)`	$\frac{\textrm{TP}\times\textrm{TN}-\textrm{FP}\times\textrm{FN}}{\sqrt{(\textrm{TP}+\textrm{FP})(\textrm{TP}+\textrm{FN})(\textrm{TN}+\textrm{FP})(\textrm{TN}+\textrm{FN})}}$	Hybrid metric between -1 and 1, useful for unbalanced classes
Receiving operating curve	`obj = roc.curve(pred, act)` `plot(obj)`	Plot with $\textrm{TPR}$ wrt $\textrm{FPR}$ for different cutoff values	Identity line is a random guess classifier.
AUC	`auc(act, pred)`	Area under the curve that plots $\textrm{TPR}$ wrt $\textrm{FPR}$	Value between 0 and 1, where 0.5 is equivalent of a classifier that makes random predictions.