Modeling
Data preparation with R
R
caret
rpart
randomForest
class
e1701
stats
factoextra
By Afshine Amidi and Shervine Amidi
Overview
There are several steps that are needed to build a machine learning model:
- feature engineering: building features that can be interpreted and that can have a high predictive power
- model selection: choosing a model that can generalize well on unseen data
- tracking performance metrics: quantifying how the model performs and include the uncertainty of predictions in the results
Data preparation
Resampling The following table summarizes the main sampling techniques that are used to correct the distribution of the data, which can be necessary when classes are imbalanced:
Action on | Command | Illustration | |
minority class | majority class | ||
(x, y) |
![]() |
||
Downsample | downSample(x, y) |
![]() |
|
Upsample with repetitions | upSample(x, y) |
![]() |
|
Upsample with SMOTE | SMOTE(form, data) |
![]() |
Relationship between variables The following command enables to display pairwise correlations as well as visual comparisons using the GGally
library:
ggpairs(data)
The matrix plot generated looks like the following:

where the representation at row $i$ and column $j$:
- gives the correlation number between variables $i$ and $j$ when $i < j$
- plots the probability distribution of variable $i$ when $i = j$
- plots the data in 2D where $x$- and $y$-axes are resp. variables $j$ and $i$ when $i > j$
Remark: this command is useful for variable selection decisions.
Scaling When features of a dataset are on different scales, it is sometimes recommended to standardize them. This operation can be done using the caret
library as follows:
# Create scaling based on reference dataset data_reference
scaling = preProcess(data_reference, method = c('center', 'scale'))
# Apply scaling to dataset data
data_scaled = predict(scaling, data)
Data splits The data can be randomly split into proportions $p$ and $1-p$ resp. for training and validation sets using the caret
library.

One can do so with the following commands:
# Create partition
train_partition = createDataPartition(y, p, list = FALSE)
# Split data into train and test sets
data_train = data[train_partition,]
data_val = data[-train_partition,]