Modeling

Data preparation with R

R caret rpart randomForest class e1701 stats factoextra

Overview

There are several steps that are needed to build a machine learning model:

feature engineering: building features that can be interpreted and that can have a high predictive power
model selection: choosing a model that can generalize well on unseen data
tracking performance metrics: quantifying how the model performs and include the uncertainty of predictions in the results

Data preparation

Resampling The following table summarizes the main sampling techniques that are used to correct the distribution of the data, which can be necessary when classes are imbalanced:

Action on		Command	Illustration
minority class	majority class	Command	Illustration
		`(x, y)`
	Downsample	`downSample(x, y)`
Upsample with repetitions		`upSample(x, y)`
Upsample with SMOTE		`SMOTE(form, data)`

Relationship between variables The following command enables to display pairwise correlations as well as visual comparisons using the GGally library:

ggpairs(data)

The matrix plot generated looks like the following:

where the representation at row $i$ and column $j$:

gives the correlation number between variables $i$ and $j$ when $i < j$
plots the probability distribution of variable $i$ when $i = j$
plots the data in 2D where $x$- and $y$-axes are resp. variables $j$ and $i$ when $i > j$

Remark: this command is useful for variable selection decisions.

Scaling When features of a dataset are on different scales, it is sometimes recommended to standardize them. This operation can be done using the caret library as follows:

# Create scaling based on reference dataset data_reference
scaling = preProcess(data_reference, method = c('center', 'scale'))

# Apply scaling to dataset data
data_scaled = predict(scaling, data)

Data splits The data can be randomly split into proportions $p$ and $1-p$ resp. for training and validation sets using the caret library.

One can do so with the following commands:

# Create partition
train_partition = createDataPartition(y, p, list = FALSE)

# Split data into train and test sets
data_train = data[train_partition,]
data_val = data[-train_partition,]