Modeling


Data preparation with R

R caret rpart randomForest class e1701 stats factoextra

By Afshine Amidi and Shervine Amidi

Overview

There are several steps that are needed to build a machine learning model:


Data preparation

Resampling The following table summarizes the main sampling techniques that are used to correct the distribution of the data, which can be necessary when classes are imbalanced:

Action on Command Illustration
minority class majority class
(x, y) Initial situation
Downsample downSample(x, y) Downsampling
Upsample with repetitions upSample(x, y) Upsampling
Upsample with SMOTE SMOTE(form, data) SMOTE

Relationship between variables The following command enables to display pairwise correlations as well as visual comparisons using the GGally library:

ggpairs(data)

The matrix plot generated looks like the following:

Relationship between variables

where the representation at row $i$ and column $j$:

Remark: this command is useful for variable selection decisions.


Scaling When features of a dataset are on different scales, it is sometimes recommended to standardize them. This operation can be done using the caret library as follows:

# Create scaling based on reference dataset data_reference
scaling = preProcess(data_reference, method = c('center''scale'))

# Apply scaling to dataset data
data_scaled = predict(scaling, data)

Data splits The data can be randomly split into proportions $p$ and $1-p$ resp. for training and validation sets using the caret library.

Partition of the dataset

One can do so with the following commands:

# Create partition
train_partition = createDataPartition(y, p, list = FALSE)

# Split data into train and test sets
data_train = data[train_partition,]
data_val = data[-train_partition,]