Modeling


Machine Learning with R

R caret rpart randomForest class e1701 stats factoextra

By Afshine Amidi and Shervine Amidi

Overview

There are several steps that are needed to build a machine learning model:


Unsupervised models

In this part, we do not have true labels, and we want to get insights from the data.

k-means The $k$-means algorithm can be performed using the base stats library.

k-means

By taking as input a matrix of features x as well as additional parameters, we have:

model = kmeans(x, params)
where the parameters are summarized below:

Parameter Command Description Default
Clusters centers Number of clusters
Iterations iter.max Maximum number of iterations until final clusters 10
Initialization nstart Number of initial configurations that are tried 1

The attributes of the model can then be accessed with the dimensions summarized below:

Dimension Command Description
Clusters cluster Number of clusters
Center center Center of each cluster
Size size Size of each cluster

The elbow method along with the distortion metric is used to select the number of clusters that make the most sense.

Remark: for reproducibility purposes, it is important to the seed to a fixed value with the set.seed() command.


Hierarchical clustering The hierarchical clustering algorithm is an unsupervised technique meant at forming clusters of observations that look alike based on their distances. It can be performed using the base stats library:

# Dissimilarity matrix
df_distance = dist(data, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc = hclust(d, method = "complete")

PCA Principal components analysis, commonly known as PCA, is a dimension reduction technique that can be used as a data preparation step to reduce the number of variables.

PCA

It is done using the prcomp function on a matrix x as follows:

model = prcomp(x, params)

where the parameters are summarized below:

Parameter Command Description Default
Centering center Boolean that indicates whether input should be centered around zero TRUE
Scaling scale Boolean that indicates whether input should be scaled FALSE

A scree plot is commonly used to check how much of the variance is explained by the newly-created principal components. The table below summarizes the main actions that can be done using the factoextra library: