Supervised learning with R study guide

Overview

There are several steps that are needed to build a machine learning model:


Supervised models

In this part, we want to learn the relationship between the data and its labels. It can either be a regression (continuous output) or classification (multi-class output).

General methods

Linear regression Linear regression is a supervised learning method used for regression problems.

Linear regression

Given a data frame data containing the independent variables x and the dependent variable y, along with a formula of the type y ~ x, we can build a linear regression model as follows:

model = lm(data, formula)

An overview of the model can be seen with summary(model) and the model object can make predictions on a set of data newdata with the following command:

predictions = predict(object, newdata, params)

where the parameters are summarized in the table below:

ParameterCommandDescriptionDefault
Standard errorsse.fitBoolean indicating whether standard errors are displayedFALSE
Confidence intervalintervalType of confidence interval. Possible values are 'none', 'confidence', 'prediction'
Confidence levellevelNumeric value between 0 and 10.95

Remark: check if predictors are multicollinear. Can be used to interpret association between predictors and output.


Logistic regression Logistic regression is a supervised learning method for classification problems.

Logistic regression

Given a data frame data containing the independent variables x and the dependent variable y, along with a formula of the type y ~ x, we can build a logistic regression model as follows:

model = glm(data, formula, family = 'binomial')


k-NN The k-nearest neighbors algorithm, also known as k-NN, is a supervised method.

k nearest neighbors

Given training features train along with labels cl as well as test features test, we can compute predictions using the class library with the following command:

predictions = knn(train, test, cl, params)

where the hyperparameters customizable by params are presented below:

ParameterCommandDescriptionDefault
Number of neighborskTypically set to 31
ProbabilisticprobBoolean indicating whether probabilistic predictions are displayedFALSE


SVM Support vector machines (SVM) is algorithm that can be used both for regression (SVR) and classification (SVC).

SVM

The following command creates a model using the e1701 library:

model = svm(data, formula, params)

where the parameters are summarized below:

CategoryParameterCommandDescriptionDefault
GeneralType of modeltype'eps-regression' for regression
'C-classification' for classification
ProbabilityprobabilityBoolean that allows probability predictions (classification)FALSE
Kernelkernel'linear', 'polynomial', 'radial', 'sigmoid''radial'
CostcostPenalization of constraints violation1
Kernel specificDegreedegreeRequired for a polynomial kernel3
$\gamma$gammaRequired for polynomial, radial and sigmoid kernels1/ncol(x)
Interceptcoef0Required for polynomial and sigmoid kernels0

We can make predictions with:

predictions = predict(object, newdata, params)

where the parameters are summarized in the table below:

ParameterCommandDescriptionDefault
Decision valuesdecision.valuesBoolean enabling intermediary computations to be returned in the multi-class classification settingFALSE
Probabilistic predictionsprobabilityBoolean value enabling probabilities predictionsFALSE


Decision tree Decision trees are one type of tree-based models that can be built for either classification or regression purposes.

The following command uses the rpart library as follows:

model = rpart(data, formula, params)

The parameters are summarized in the table below:

CategoryParameterCommandDescriptionDefault
GeneralType of modelmethod'anova' for regression
'class' for classification
ParametersSample splitminsplitMinimum number of observations in node before split is attempted20
Sample bucketminbucketMinimum number of observations in terminal leafminsplit/3
Complexity parametercpAmount of relative error that needs to happen to allow an additional split, with small values leading to overfitting0.01
Maximum depthmaxdepthMaximum depth allowed in the final tree30

The resulting tree can be interpreted using commands coming from the rpart.plot library, that are summarized in the table below:

ActionCommand
Draw tree structurerpart.plot(model)
Retrieve tree rulesrpart.rules(model)


Ensemble models

Random forest Random forest is a tree-based ensemble model.

The following command uses the randomForest library:

model = randomForest(data, formula, params)

where the parameters are summarized below:

ParameterCommandDescriptionDefault
Number of treesntreeNumber of independent trees of the random forest500
Minimum leaf sizenodesizeMinimum number of observations that can be in a tree leaf, where the higher the value, the smaller the tree5
1
Number of sample predictorsmtryNumber of predictors considered for each independent treencol(x)/3
sqrt(ncol(x))
Sampling replacementreplaceBoolean value that indicates whether sampling of cases is done with replacementTRUE
Class weightclasswtParameter only used in classificationNULL


We can make predictions as follows:

predictions = predict(model, newdata, params)

where the parameters are summarized in the table below:

ParameterCommandDescriptionDefault
Output typetypeChoose output type to be predicted values ('response'), matrix of class probabilities ('prob') or matrix of vote counts ('votes')'response'
Predictionspredict.allBoolean indicating whether all tree predictions are kept in memoryFALSE


XGBoost Gradient boosting models can be trained using the xgboost library.

First the data needs to be in DMatrix format, which can be done with the following command:

data_xgb = xgb.DMatrix(data, label)

The model can then be trained as follows:

model = xgb.train(data, label, params)

where the parameters are summarized below:

CategoryParameterCommandDescriptionDefault
GeneralType of modelobjective'reg:linear' for regression'reg:linear'
'binary:logistic', multi:softmax, multi:softprob for classification
Type of boosterbooster'gbtree' for tree-based boosters, 'gblinear' for linear-based boosters'gbtree'
Number of iterationsnroundsTypically set to 50
Tree-basedDepth of treesmax.depthMaximum depth of trees6
Step sizeetaStep size of each boosting step0.3
LinearL1 regularizationalphaLASSO coefficient0
L2 regularizationlambdaRidge coefficients0
lambda_bias0
PerformanceEvaluation setwatchlistList of DMatrix used for metrics generation
Metriceval_metric'rmse', 'mae' for regression
'error', 'logloss', 'mlogloss', 'auc' for classification

Predictions are made as follows:

predict(object, newdata, params)


Model performance

Cross-validation Cross-validation is a technique used to find the optimal hyperparameters of a model by making sure that it does not overfit the training set. It divides the dataset into $k$ splits, and trains the model on $k-1$ splits while making predictions on the $k^{th}$ one.

Cross-validation


This can be achieved with the caret library that will fit the resulting model on the training set with the cross-validated optimal hyperparmeters.

First, the training setting is defined as follows:

# Set partitions
train_control = trainControl(params)

where the parameters are summarized in the table below:

ParameterCommandDescription
SamplingmethodExamples include 'boot', 'cv', 'repeatedcv'
FoldsnumberNumber of folds
RepeatsrepeatsNumber of times to perform cross validation

Then, a model can be fit using the control parameters as follows:

# Find optimal parameters with cross-validation
model = train(data, formula, trControl = train_control, method, metric, maximize, tuneGrid)

The parameters are summed up in the table below:

ParameterCommandDescriptionDefault
Type of modelmethodExamples include 'glmnet', 'rf', 'svmLinearWeights', 'rpart''rf'
Grid searchtuneGridexpand.grid(parameters = range)NULL
trControltrainControl(params)trainControl()
Metricmetric'RMSE', 'R2', 'MAE' for regression'RMSE'
'Accuracy', 'ROC', 'F1' for classification'Accuracy'
maximizeBoolean indicating whether to maximize the chosen metricFALSE
TRUE


Regression metrics The table below summarizes the main metrics used in regression problems, based on predicted pred and actuals act, by using the caret library:

MetricCommandDefinitionInterpretation
RMSERMSE(pred, act)$\displaystyle\sqrt{\textrm{MSE}}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y_i})^2}$Root mean square error
MAEMAE(pred, act)$\displaystyle\frac{1}{n}\sum_{i=1}^n|y_i-\widehat{y_i}|$Mean average error
$R^2$R2(
  pred, act,
  form = 'traditional'
)
$\displaystyle1-\frac{\textrm{SSE}}{\textrm{SST}}$
with sum of squares total ($\textrm{SST}$) and errors ($\textrm{SSE}$) defined as:
  • $\displaystyle\textrm{SST}=\sum_{i=1}^n(y_i-\overline{y})^2$
  • $\displaystyle\textrm{SSE}=\sum_{i=1}^n(y_i-\widehat{y_i})^2$
Proportion of the variance that is explained by the model


Classification metrics The table below summarizes the main metrics used in classification problems, that can be accessed with the caret, ModelMetrics and the PRROC libraries:

MetricCommandDefinitionInterpretation
Confusion matrixconfusionMatrix(act, pred)Matrix of $\textrm{TP}$, $\textrm{FP}$, $\textrm{TN}$, $\textrm{FN}$Granular view of model performance
Accuracymean(act == pred)$\displaystyle\frac{\textrm{TP}+\textrm{TN}}{\textrm{TP}+\textrm{TN}+\textrm{FP}+\textrm{FN}}$How accurate predictions are
Precisionprecision(act, pred)$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}$How accurate positive predictions are
Recall
Sensitivity
$\textrm{TPR}$
recall(act, pred)$\displaystyle\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}$Coverage of actual positive samples
Specificity
$\textrm{TNR}$
specificity(act, pred)$\displaystyle\frac{\textrm{TN}}{\textrm{TN}+\textrm{FP}}$Coverage of actual negative samples
F1 scoref1Score(act, pred)$\displaystyle\frac{2\textrm{TP}}{2\textrm{TP}+\textrm{FP}+\textrm{FN}}$Harmonic mean of precision and recall
Matthew's correlation coefficientmcc(act, pred, cutoff)$\frac{\textrm{TP}\times\textrm{TN}-\textrm{FP}\times\textrm{FN}}{\sqrt{(\textrm{TP}+\textrm{FP})(\textrm{TP}+\textrm{FN})(\textrm{TN}+\textrm{FP})(\textrm{TN}+\textrm{FN})}}$Hybrid metric between -1 and 1, useful for unbalanced classes
Receiving operating curveobj = roc.curve(pred, act)
plot(obj)
Plot with $\textrm{TPR}$ wrt $\textrm{FPR}$ for different cutoff valuesIdentity line is a random guess classifier.
AUCauc(act, pred)Area under the curve that plots $\textrm{TPR}$ wrt $\textrm{FPR}$Value between 0 and 1, where 0.5 is equivalent of a classifier that makes random predictions.