Modeling
Machine Learning with R
R caret rpart randomForest class e1701 stats factoextra
By Afshine Amidi and Shervine Amidi
Overview
There are several steps that are needed to build a machine learning model:
- feature engineering: building features that can be interpreted and that can have a high predictive power
- model selection: choosing a model that can generalize well on unseen data
- tracking performance metrics: quantifying how the model performs and include the uncertainty of predictions in the results
Unsupervised models
In this part, we do not have true labels, and we want to get insights from the data.
k-means The $k$-means algorithm can be performed using the base stats library.
By taking as input a matrix of features x as well as additional parameters, we have:
model = kmeans(x, params)
| Parameter | Command | Description | Default |
| Clusters | centers |
Number of clusters | |
| Iterations | iter.max |
Maximum number of iterations until final clusters | 10 |
| Initialization | nstart |
Number of initial configurations that are tried | 1 |
The attributes of the model can then be accessed with the dimensions summarized below:
| Dimension | Command | Description |
| Clusters | cluster |
Number of clusters |
| Center | center |
Center of each cluster |
| Size | size |
Size of each cluster |
The elbow method along with the distortion metric is used to select the number of clusters that make the most sense.
Remark: for reproducibility purposes, it is important to the seed to a fixed value with the set.seed() command.
Hierarchical clustering The hierarchical clustering algorithm is an unsupervised technique meant at forming clusters of observations that look alike based on their distances. It can be performed using the base stats library:
# Dissimilarity matrix
df_distance = dist(data, method = "euclidean")
# Hierarchical clustering using Complete Linkage
hc = hclust(d, method = "complete")
PCA Principal components analysis, commonly known as PCA, is a dimension reduction technique that can be used as a data preparation step to reduce the number of variables.
It is done using the prcomp function on a matrix x as follows:
model = prcomp(x, params)
where the parameters are summarized below:
| Parameter | Command | Description | Default |
| Centering | center |
Boolean that indicates whether input should be centered around zero | TRUE |
| Scaling | scale |
Boolean that indicates whether input should be scaled | FALSE |
A scree plot is commonly used to check how much of the variance is explained by the newly-created principal components. The table below summarizes the main actions that can be done using the factoextra library:
Modeling