Modeling

Machine Learning with R

R caret rpart randomForest class e1701 stats factoextra

By Afshine Amidi and Shervine Amidi

Overview

There are several steps that are needed to build a machine learning model:

feature engineering: building features that can be interpreted and that can have a high predictive power
model selection: choosing a model that can generalize well on unseen data
tracking performance metrics: quantifying how the model performs and include the uncertainty of predictions in the results

Unsupervised models

In this part, we do not have true labels, and we want to get insights from the data.

k-means The $k$-means algorithm can be performed using the base stats library.

By taking as input a matrix of features x as well as additional parameters, we have:

model = kmeans(x, params)

where the parameters are summarized below:

Parameter	Command	Description	Default
Clusters	`centers`	Number of clusters
Iterations	`iter.max`	Maximum number of iterations until final clusters	`10`
Initialization	`nstart`	Number of initial configurations that are tried	`1`

The attributes of the model can then be accessed with the dimensions summarized below:

Dimension	Command	Description
Clusters	`cluster`	Number of clusters
Center	`center`	Center of each cluster
Size	`size`	Size of each cluster

The elbow method along with the distortion metric is used to select the number of clusters that make the most sense.

Remark: for reproducibility purposes, it is important to the seed to a fixed value with the set.seed() command.

Hierarchical clustering The hierarchical clustering algorithm is an unsupervised technique meant at forming clusters of observations that look alike based on their distances. It can be performed using the base stats library:

# Dissimilarity matrix
df_distance = dist(data, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc = hclust(d, method = "complete")

PCA Principal components analysis, commonly known as PCA, is a dimension reduction technique that can be used as a data preparation step to reduce the number of variables.

It is done using the prcomp function on a matrix x as follows:

model = prcomp(x, params)

where the parameters are summarized below:

Parameter	Command	Description	Default
Centering	`center`	Boolean that indicates whether input should be centered around zero	`TRUE`
Scaling	`scale`	Boolean that indicates whether input should be scaled	`FALSE`

A scree plot is commonly used to check how much of the variance is explained by the newly-created principal components. The table below summarizes the main actions that can be done using the factoextra library: