Unsupervised learning with R study guide

Overview

There are several steps that are needed to build a machine learning model:


Unsupervised models

In this part, we do not have true labels, and we want to get insights from the data.

k-means The $k$-means algorithm can be performed using the base stats library.

k-means


By taking as input a matrix of features x as well as additional parameters, we have:

model = kmeans(x, params)

where the parameters are summarized below:

ParameterCommandDescriptionDefault
ClusterscentersNumber of clusters
Iterationsiter.maxMaximum number of iterations until final clusters10
InitializationnstartNumber of initial configurations that are tried1

The attributes of the model can then be accessed with the dimensions summarized below:

DimensionCommandDescription
ClustersclusterNumber of clusters
CentercenterCenter of each cluster
SizesizeSize of each cluster

The elbow method along with the distortion metric is used to select the number of clusters that make the most sense.

Remark: for reproducibility purposes, it is important to the seed to a fixed value with the set.seed() command.


Hierarchical clustering The hierarchical clustering algorithm is an unsupervised technique meant at forming clusters of observations that look alike based on their distances. It can be performed using the base stats library:

# Dissimilarity matrix
df_distance = dist(data, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc = hclust(d, method = "complete")


PCA Principal components analysis, commonly known as PCA, is a dimension reduction technique that can be used as a data preparation step to reduce the number of variables.

PCA


It is done using the prcomp function on a matrix x as follows:

model = prcomp(x, params)

where the parameters are summarized below:

ParameterCommandDescriptionDefault
CenteringcenterBoolean that indicates whether input should be centered around zeroTRUE
ScalingscaleBoolean that indicates whether input should be scaledFALSE

A scree plot is commonly used to check how much of the variance is explained by the newly-created principal components. The table below summarizes the main actions that can be done using the factoextra library: