# 1 Introduction

## 1.1 Purpose of this vignette

This vignette is a tutorial to prepare a train and a test set using dataPreparation package.

In this tutorial the following points are going to be viewed:

• Preparing a training set,
• Applying the same preparation to a testing set,
• Controling that train and test sets have the same shape.

Using dataPreparation package, those sets will be

This package is

• fast (use data.table and exponential search)
• RAM efficient (perform operations by reference and column-wise to avoid copying data)
• stable (most exceptions are handled)
• easy (since those functions are packaged and handle most of the situations)
• verbose (log a lot)

## 1.2 Data set

For this tutorial, UCI adult data set will be used.

The goal with this data set is to predict the income of individuals based on 14 variables.

Let's have a look to the data set:

data("adult")
print(head(adult, n = 4))
#   age    type_employer fnlwgt education education_num            marital
# 1  39        State-gov  77516 Bachelors            13      Never-married
# 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
# 3  38          Private 215646   HS-grad             9           Divorced
# 4  53          Private 234721      11th             7 Married-civ-spouse
#          occupation  relationship  race  sex capital_gain capital_loss
# 1      Adm-clerical Not-in-family White Male         2174            0
# 2   Exec-managerial       Husband White Male            0            0
# 3 Handlers-cleaners Not-in-family White Male            0            0
# 4 Handlers-cleaners       Husband Black Male            0            0
#   hr_per_week       country income
# 1          40 United-States  <=50K
# 2          13 United-States  <=50K
# 3          40 United-States  <=50K
# 4          40 United-States  <=50K

# 2 Preparing data

## 2.1 Spliting Train and test

To avoid introducing a bias in test using train-data, the train-test split should be performed before (most) data preparation steps.

To simulate a train and test set we are going to split randomly this data set into 80% train and 20% test.

# Random sample indexes

# Build X_train, y_train, X_test, y_test

y_test <- adult[test_index, "income"]

## 2.2 Filter useless variables

The first thing to do, in order to make computation fast, would be to filter useless variables:

• Constant variables
• Variables that are in double (for example col1 == col2)
• Variables that are exact bijections (for example col1 = A, B, B, A and col2 = 1, 2, 2, 1)

Let's id them:

constant_cols <- which_are_constant(adult)
# [1] "which_are_constant: it took me 0s to identify 0 constant column(s)"
double_cols <- which_are_in_double(adult)
# [1] "which_are_in_double: it took me 0s to identify 0 column(s) to drop."
bijections_cols <- which_are_bijection(adult)
# [1] "which_are_bijection: it took me 0.06s to identify 1 column(s) to drop."

We only found, one bijection: variable education_num which is an index for variable education. Let's drop it:

X_train$education_num = NULL X_test$education_num = NULL

## 2.3 Scaling

Most machine learning algorithm rather handle scaled data instead of unscaled data.

To perform scaling (meaning setting mean to 0 and standard deviation to 1), function fast_scale is available.

Since it is highly recommended to apply same scaling on train and test, you should compute the scales first using the function build_scales:

scales <- build_scales(data_set = X_train, cols = c("capital_gain", "capital_loss"), verbose = TRUE)
# [1] "build_scales: I will compute scale on  2 numeric columns."
# [1] "build_scales: it took me: 0s to compute scale for 2 numeric columns."
print(scales)
# $capital_gain #$capital_gain$mean # [1] 1085.825 # #$capital_gain$sd # [1] 7428.122 # # #$capital_loss
# $capital_loss$mean
# [1] 85.09924
#
# $capital_loss$sd
# [1] 398.067

As one can see, those to columns have very different mean and standard deviation. Let's apply scaling on those:

X_train <- fast_scale(data_set = X_train, scales = scales, verbose = TRUE)
# [1] "fast_scale: I will scale 2 numeric columns."
# [1] "fast_scale: it took me: 0s to scale 2 numeric columns."
X_test <- fast_scale(data_set = X_test, scales = scales, verbose = TRUE)
# [1] "fast_scale: I will scale 2 numeric columns."
# [1] "fast_scale: it took me: 0s to scale 2 numeric columns."

And now let's have a look at the result:

print(head(X_train[, c("capital_gain", "capital_loss")]))
#    capital_gain capital_loss
# 1:    0.4009324   -0.2137812
# 2:   -0.1461776    4.5643086
# 3:   -0.1461776    3.7152054
# 4:   -0.1461776   -0.2137812
# 5:    0.8363049   -0.2137812
# 6:   -0.1461776   -0.2137812

## 2.4 Discretization

One might want to discretize the variable age, either using an equal freq/width method, or some hand-written bis.

To compute equal freq bins, build_bins is available:

bins <- build_bins(data_set = X_train, cols = "age", n_bins = 10, type = "equal_freq")
# [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_freq method."
# [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns."
print(bins)
# $age # [1] -Inf 22 26 30 33 37 41 45 51 58 Inf To make it easy to use, in this package: • data_set will always denote the data.table on which you want to perform something. • cols will always denote the columns on which you want to apply the function. It could also be set to "auto" to apply it on all relevant columns. • Some spefic argument could be needed and will be presented in the documentation of each functions. Let's apply our own bins: X_train <- fast_discretization(data_set = X_train, bins = list(age = c(0, 18, 25, 45, 62, +Inf))) # [1] "fast_discretization: I will discretize 1 numeric columns using, bins." # [1] "fast_discretization: it took me: 0.11s to transform 1 numeric columns into, binarised columns." X_test <- fast_discretization(data_set = X_test, bins = list(age = c(0, 18, 25, 45, 62, +Inf))) # [1] "fast_discretization: I will discretize 1 numeric columns using, bins." # [1] "fast_discretization: it took me: 0.03s to transform 1 numeric columns into, binarised columns." Here bins have been defined to compute groups : • 0 to 18 • 18 to 25 • 25 to 45 • 45 to 62 • Over 62. Let's control it: print(table(X_train$age))
#
#    [0, 18[   [18, 25[   [25, 45[   [45, 62[ [62, +Inf[
#        319       4156      13264       6645       1664

## 2.5 Encoding categorical

One thing to do when you are using some machine learning algorithm such as a logistic regression or a neural network is to encode factor variables. One way to do that is to perform one-hot-encoding. For examples:

ID var
1 A
2 B
3 C
4 C

Would become:

ID var.A var.B var.C
1 1 0 0
2 0 1 0
3 0 0 1
4 0 0 1

To perform it, one could use dataPreparation::one_hot_encoder which uses data.table power to do it in a fast and RAM efficient way. Since it is important to have the same columns in train and test first, one will compute the encoding:

encoding <- build_encoding(data_set = X_train, cols = "auto", verbose = TRUE)
# [1] "fnlwgt"       "capital_gain" "capital_loss" "hr_per_week"
# [1] "build_encoding: c(\"fnlwgt\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor or character i do nothing for those variables."
# [1] "build_encoding: I will compute encoding on 9 character and factor columns."
# [1] "build_encoding: it took me: 0s to compute encoding for 9 character and factor columns."

The argument cols = "auto" means that build_encoding will automatically select all columns that are either character or factor to prepare encoding.

And then one can apply them to both tables:

X_train <- one_hot_encoder(data_set = X_train, encoding = encoding, drop = TRUE, verbose = TRUE)
# [1] "one_hot_encoder: I will one hot encode some columns."
# [1] "one_hot_encoder: I am doing column: age"
# [1] "one_hot_encoder: I am doing column: type_employer"
# [1] "one_hot_encoder: I am doing column: education"
# [1] "one_hot_encoder: I am doing column: marital"
# [1] "one_hot_encoder: I am doing column: occupation"
# [1] "one_hot_encoder: I am doing column: relationship"
# [1] "one_hot_encoder: I am doing column: race"
# [1] "one_hot_encoder: I am doing column: sex"
# [1] "one_hot_encoder: I am doing column: country"
# [1] "one_hot_encoder: It took me 0s to transform 9 column(s)."
X_test <- one_hot_encoder(data_set = X_test, encoding = encoding, drop = TRUE, verbose = TRUE)
# [1] "one_hot_encoder: I will one hot encode some columns."
# [1] "one_hot_encoder: I am doing column: age"
# [1] "one_hot_encoder: I am doing column: type_employer"
# [1] "one_hot_encoder: I am doing column: education"
# [1] "one_hot_encoder: I am doing column: marital"
# [1] "one_hot_encoder: I am doing column: occupation"
# [1] "one_hot_encoder: I am doing column: relationship"
# [1] "one_hot_encoder: I am doing column: race"
# [1] "one_hot_encoder: I am doing column: sex"
# [1] "one_hot_encoder: I am doing column: country"
# [1] "one_hot_encoder: It took me 0s to transform 9 column(s)."

This function is called the following way:

• data_set = X_train: means that it will perform transformation on X_train
• encoding = encoding: means that we use previously built encoding
• drop = TRUE: means that it will drop original columns
• verbose = TRUE: means that it will log to tell you what it is doing.

Even if it's not kept in the log, a progress bar has been created to see if the functions is running and how fast. This progress bar is available in most functions from this package. It can be really helpfull when you are handling really large data sets.

Let's check the dimensions of X:

print("Dimensions of X_train: ")
# [1] "Dimensions of X_train: "
print(dim(X_train))
# [1] 26048   111
print("Dimensions of X_test: ")
# [1] "Dimensions of X_test: "
print(dim(X_test))
# [1] 6513  111

## 2.6 Filtering variables

Since a lot of columns have been created, a filtering could be relevant:

bijections <- which_are_bijection(data_set = X_train, verbose = TRUE)
# [1] "which_are_bijection: it took me 6.22s to identify 1 column(s) to drop."
print(names(X_train)[bijections])
# [1] "sex.Male"

Thanks to optimisations (such as exponential search), it takes only 7s to compare 111 columns together and identify bijections. Without those optimisations it can be minutes long.

which_are_bijection found that column sex.Male is a bijection of column sex.Female.

Let's drop one of them:

set(X_train, NULL, names(X_train)[bijections], NULL)
set(X_test, NULL, names(X_train)[bijections], NULL)

# 3 Controling shape

Last but not least, it is very important to make sure that train and test sets have the same shape (for example the same columns).

To make sure of that one could perform the following function:

X_test <- same_shape(X_test, reference_set = X_test, verbose = TRUE)
# [1] "same_shape: verify that every column is present."
# [1] "same_shape: drop unwanted columns."
# [1] "same_shape: verify that every column is in the right type."
# [1] "same_shape: verify that every factor as the right number of levels."

No warning have been raised it's all is ok.

# 4 Conclusion

We presented some of the functions of dataPreparation package. There are a few more available, plus they have some parameters to make their use easier. So if you liked it, please go check the package documentation (by installing it or on CRAN)

We hope that this package is helpful, that it helped you prepare your data in a faster way.

If you would like to give us some feedback, report some issues, add some features to this package, please tell us on GitHub. Also if you want to contribute, please don't hesitate to contact us.