Modeling
Machine Learning with R
R
caret
rpart
randomForest
class
e1701
stats
factoextra
By Afshine Amidi and Shervine Amidi
Overview
There are several steps that are needed to build a machine learning model:
- feature engineering: building features that can be interpreted and that can have a high predictive power
- model selection: choosing a model that can generalize well on unseen data
- tracking performance metrics: quantifying how the model performs and include the uncertainty of predictions in the results
Unsupervised models
In this part, we do not have true labels, and we want to get insights from the data.
k-means The $k$-means algorithm can be performed using the base stats
library.

By taking as input a matrix of features x
as well as additional parameters, we have:
model = kmeans(x, params)
Parameter | Command | Description | Default |
Clusters | centers |
Number of clusters | |
Iterations | iter.max |
Maximum number of iterations until final clusters | 10 |
Initialization | nstart |
Number of initial configurations that are tried | 1 |
The attributes of the model can then be accessed with the dimensions summarized below:
Dimension | Command | Description |
Clusters | cluster |
Number of clusters |
Center | center |
Center of each cluster |
Size | size |
Size of each cluster |
The elbow method along with the distortion metric is used to select the number of clusters that make the most sense.
Remark: for reproducibility purposes, it is important to the seed to a fixed value with the set.seed()
command.
Hierarchical clustering The hierarchical clustering algorithm is an unsupervised technique meant at forming clusters of observations that look alike based on their distances. It can be performed using the base stats
library:
# Dissimilarity matrix
df_distance = dist(data, method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc = hclust(d, method = "complete")
PCA Principal components analysis, commonly known as PCA, is a dimension reduction technique that can be used as a data preparation step to reduce the number of variables.

It is done using the prcomp
function on a matrix x
as follows:
model = prcomp(x, params)
where the parameters are summarized below:
Parameter | Command | Description | Default |
Centering | center |
Boolean that indicates whether input should be centered around zero | TRUE |
Scaling | scale |
Boolean that indicates whether input should be scaled | FALSE |
A scree plot is commonly used to check how much of the variance is explained by the newly-created principal components. The table below summarizes the main actions that can be done using the factoextra
library: