Abstract
In this vignette, we go through creating balanced partitions for train/test sets and balanced folds for crossvalidation with the R package groupdata2
. We write a simple crossvalidation function and use it on some linear regression models.
groupdata2
is a set of methods for easy grouping, windowing, folding, partitioning and splitting of data.
For a more extensive description of groupdata2
, please see Description of groupdata2
Contact author at [email protected]
In this vignette, we will train a couple of linear regression models on some data. We will use crossvalidation with balanced folds to find the best model, and then test that model on a subset of the original data that we haven’t used for training the model.
Our first task will be to split the dataset into a training set and a test set using partition()
. Then, we will create folds using the fold()
function. We will code up a simple crossvalidation function and put some models to the test!
While we will be writing our own code for crossvalidation, it is recommended to use the cvms
package for this functionality, when working with linear or logistic regression models.
partition() creates (optionally) balanced partitions (e.g. train/test sets) from given group sizes. It can balance partitions on one categorical variable (e.g. diagnosis) and/or one numerical variable (e.g. age). It is also able to keep all datapoints with a shared ID (e.g. participant_id) in the same partition.
fold() creates (optionally) balanced folds for crossvalidation. It can balance folds on one categorical variable (e.g. diagnosis) and/or one numerical variable (e.g. age). It is also able to keep all datapoints with a shared ID (e.g. participant_id) in the same fold. If we want to run repeated crossvalidation, it can create multiple unique fold columns at once.
The essence of crossvalidation is to test a model against data that it hasn’t been trained on, i.e. estimating outofsample error. It is done by first dividing the data into groups called folds.
Say we choose to divide the data into 5 folds. Then, in the first iteration, we train a model on the first four folds and test it on the fifth fold. In the second iteration, we then train on folds 2,3,4,5 and test on fold 1. We continue changing which fold is the test fold until all folds have been test folds (i.e. we train and test 5 times in total). In the end we get the average performance of the models and compare these to other crossvalidated models.
The model with the lowest average error is believed to be the best at predicting unseen data from the same population(s) and thus chosen for further interpretation / use.
This is a great technique, and fold()
makes it even more powerful.
Even with crossvalidation, we have a risk of overfitting our models to our data. That is, while we get really good predictions for our current data, it won’t generalize to new data that we might gather in the future.
To test if this is the case, we store some of the data and don’t touch it before we feel confident that we have found the best model. Then we use the model to predict the targets / dependent variable in that data. If the results are much worse on the test set, it likely means that our model is suffering from overfitting! Then we might go back and adjust the models, until we’re pretty confident again. We might also change the number of folds or run repeated crossvalidation instead.
Generally though, we should use the test set as few times (>1) as possible, as running this cycle too many times could accidentally find a model that is a good predictor of the specific datapoints in the test set, but not in general.
If the test set results are instead somewhat similar to the crossvalidation results, these are the results that we report (possibly along with the crossvalidation results).
Let’s say, we have scored 10 participants with either of two diagnoses 'a'
and 'b'
on a very interesting task, that you are free to call ‘the task’.
# Attach some packages
library(groupdata2)
library(dplyr)
library(ggplot2)
library(knitr) # kable()
library(lmerTest) #lmer()
library(broom) #tidy()
library(hydroGOF) # rmse()
# Create data frame
df < data.frame("participant" = factor(as.integer(
rep(c('1','2', '3', '4', '5',
'6', '7', '8', '9', '10'), 3))),
"age" = rep(c(20,23,27,21,32,31,43,21,34,32), 3),
"diagnosis" = factor(rep(c('a', 'b', 'a', 'b', 'b',
'a', 'a', 'a', 'b', 'b'), 3)),
"score" = c(10,24,15,35,24,14,11,16,33,29, # for 1st session
24,40,30,50,54,25,35,32,53,55, # for 2nd session
45,67,40,78,62,30,41,44,66,81)) # for 3rd session
# Order by participant
df < df %>% arrange(participant)
# Remove index
rownames(df) < NULL
# Add session info
df$session < as.integer(rep(c('1','2', '3'), 10))
# Show the data frame
kable(df, align = 'c')
participant  age  diagnosis  score  session 

1  20  a  10  1 
1  20  a  24  2 
1  20  a  45  3 
2  23  b  24  1 
2  23  b  40  2 
2  23  b  67  3 
3  27  a  15  1 
3  27  a  30  2 
3  27  a  40  3 
4  21  b  35  1 
4  21  b  50  2 
4  21  b  78  3 
5  32  b  24  1 
5  32  b  54  2 
5  32  b  62  3 
6  31  a  14  1 
6  31  a  25  2 
6  31  a  30  3 
7  43  a  11  1 
7  43  a  35  2 
7  43  a  41  3 
8  21  a  16  1 
8  21  a  32  2 
8  21  a  44  3 
9  34  b  33  1 
9  34  b  53  2 
9  34  b  66  3 
10  32  b  29  1 
10  32  b  55  2 
10  32  b  81  3 
As we can see, there are 5 participants for each diagnosis, and they all seem to have gotten better at the task throughout the three sessions they went through.
In this step, we will split our data into two subsets called train_set
and test_set
. The main point will be, that we must avoid leakage between the two sets, before it is of any use to us.
If we were to split the data randomly, with 80 percent going to the training set, and 20 percent going to the test set, we would likely have some of the participants in both the training set AND the test set. But our motivation for having a test set is that we want to know how well our model predicts new, future participants, not how well it knows the ones it saw during training. Furthermore, if our model is overfitted to those participants, our test set might not warn us of this. With this approach, we could get a really low error on both the test set and the training set even though our model is useless outside of the data, we’re working with.
So, how do we deal with this? In this case, it’s as simple as making sure each participant is only in one of the data sets. We can do this with partition()
and the id_col
argument.
xpectr::set_test_seed(1) # For reproducibility
# Split data in 20/80 (percentage)
partition(df, p = 0.2, id_col = "participant") %>%
.[1] %>% # See only the test set
kable() # Pretty tables :)

The above shows the test set. It contains 2 participants (10 and 5). They both have all 3 sessions in this subset, meaning they are not in the training set! Perfect!
But ehmm.. if we look at the diagnosis column, they both have the diagnosis 'b'
. If we would like to know how well our future model classifies both of the diagnoses, this won’t do. It would be nicer if we have somewhat the same balance of both diagnoses in both the training set and the test set. This is luckily what the cat_col
argument is for.
More specifically (behind the scenes), it first subsets the full dataset by each class in the categorical column. Then, it creates partitions in each subset and merges the partitions in the end. When using both the id_col
and cat_col
arguments, it first subsets by class, then partitions by the unique values in id_col
. This means, that the final partition sizes might not always be exactly as specified, depending on the data. Therefore, it is good practice to look at the distribution of participants, classes, etc. in the final partitions, which we will do after trying out that cat_col
argument to get our final test and train sets!
xpectr::set_test_seed(1) # For reproducibility
# Split data in 20/80 (percentage)
parts < partition(df, p = 0.2, id_col = "participant", cat_col = 'diagnosis')
test_set < parts[[1]]
train_set < parts[[2]]
# Show test_set
test_set %>% kable()
participant  age  diagnosis  score  session 

8  21  a  16  1 
8  21  a  32  2 
8  21  a  44  3 
10  32  b  29  1 
10  32  b  55  2 
10  32  b  81  3 
Now, our test set contains 2 participants (8 and 10), and the diagnosis column contains 50 percent a’s and 50 percent b’s. Let’s count the diagnoses in the training set as well.
diagnosis  n 

a  12 
b  12 
There are 12 rows for each diagnosis in the training set. In this case, we know that the rest of the participants are all “fully” in this dataset. But let’s count them anyway, so we will remember to do it in the future.
participant  n 

1  3 
2  3 
3  3 
4  3 
5  3 
6  3 
7  3 
9  3 
In this section, we will create balanced folds for crossvalidation. The thought process resembles that in the previous section. fold()
basically just creates a number of similarly sized partitions. Where partition()
returned a list
of data frame
s (this is optional, btw.), fold()
will return the entire training set with a new column called ".folds"
. This will be used directly in the crossvalidation function to subset the data on each iteration. Remember, that the folds take shifts at being the crossvalidation test set (not to be confused with the test set we created in the previous step).
As we’re basically recreating the training / test set scenario that we discussed previously, we want to avoid leakage between the folds. We would also like somewhat balanced distributions of the diagnoses between the folds, though this depends on the context. Let’s create our balanced folds:
xpectr::set_test_seed(1) # For reproducibility
train_set < fold(train_set, k = 4, cat_col = 'diagnosis', id_col = 'participant')
# Order by .folds
train_set < train_set %>% arrange(.folds)
train_set %>% kable()
participant  age  diagnosis  score  session  .folds 

7  43  a  11  1  1 
7  43  a  35  2  1 
7  43  a  41  3  1 
2  23  b  24  1  1 
2  23  b  40  2  1 
2  23  b  67  3  1 
1  20  a  10  1  2 
1  20  a  24  2  2 
1  20  a  45  3  2 
5  32  b  24  1  2 
5  32  b  54  2  2 
5  32  b  62  3  2 
6  31  a  14  1  3 
6  31  a  25  2  3 
6  31  a  30  3  3 
4  21  b  35  1  3 
4  21  b  50  2  3 
4  21  b  78  3  3 
3  27  a  15  1  4 
3  27  a  30  2  4 
3  27  a  40  3  4 
9  34  b  33  1  4 
9  34  b  53  2  4 
9  34  b  66  3  4 
The training set now contains the ".folds"
column. As fold()
groups the output data frame
by this column, we can count the number of rows for each diagnosis and participant per fold like so:
.folds  participant  diagnosis  n 

1  2  b  3 
1  7  a  3 
2  1  a  3 
2  5  b  3 
3  4  b  3 
3  6  a  3 
4  3  a  3 
4  9  b  3 
Fold 1 contains participants 2 and 7 with 3 rows each. Participant 2 has the diagnosis 'b'
and participant 7 has the diagnosis 'a'
. This pattern is the same for all folds. This dataset was created for the purpose of showing these functions, so we should expect real world data to be less perfectly distributed and remember to run some checks of the created folds and partitions. Also be aware, that the maximum number of folds is restricted by the number of participants in the classes, if using the id_col
and the cat_col
arguments together. Remember, that the function first subsets the dataset by cat_col
and then creates groups (folds) from the unique values in id_col
in each subset, before merging the subsets. So if you only have 3 participants in one of the classes, this will be the maximum number of folds you can create.
What’s left, you ask? Well, now we get to have fun, training and comparing models using crossvalidation! If you just came for partition()
and fold()
, you can skip this part. Personally, I will move on and walk you (you won’t let me walk alone, will you? pleaase?) through a simple crossvalidation function for testing a set of models. The models will be predicting the scores of the participants.
We have 4 folds in our training set, so we want to train our models on 3 of the folds and test on the last fold. All the folds should be used as test fold once. While there are often faster alternatives to a for loop, we will use it to illustrate the process. We will create a simple crossvalidation function where we can specify the model to test, and whether it has random effects. The performance of the model will be measured with RMSE (Root Mean Square Error).
crossvalidate < function(data, k, model, dependent, random = FALSE){
# 'data' is the training set with the ".folds" column
# 'k' is the number of folds we have
# 'model' is a string describing a linear regression model formula
# 'dependent' is a string with the name of the score column we want to predict
# 'random' is a logical (TRUE/FALSE); do we have random effects in the model?
# Initialize empty list for recording performances
performances < c()
# One iteration per fold
for (fold in 1:k){
# Create training set for this iteration
# Subset all the datapoints where .folds does not match the current fold
training_set < data[data$.folds != fold,]
# Create test set for this iteration
# Subset all the datapoints where .folds matches the current fold
testing_set < data[data$.folds == fold,]
## Train model
# If there is a random effect,
# use lmer() to train model
# else use lm()
if (isTRUE(random)){
# Train linear mixed effects model on training set
model < lmer(model, training_set, REML = FALSE)
} else {
# Train linear model on training set
model < lm(model, training_set)
}
## Test model
# Predict the dependent variable in the testing_set with the trained model
predicted < predict(model, testing_set, allow.new.levels = TRUE)
# Get the Root Mean Square Error between the predicted and the observed
RMSE < rmse(predicted, testing_set[[dependent]])
# Add the RMSE to the performance list
performances[fold] < RMSE
}
# Return the mean of the recorded RMSEs
return(c('RMSE' = mean(performances)))
}
We could have a hypothesis that people with the diagnosis 'b'
tend to be better at the experiment. This could be tested by a simple linear model.
lm(score ~ diagnosis, df) %>%
tidy()
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 27.5 4.07 6.74 0.000000253
#> 2 diagnosisb 22.6 5.76 3.92 0.000515
The linear model supports the hypothesis, as the average score of participants with diagnosis 'b'
is larger than that of participants with diagnosis 'a'
, and as this difference is statistically significant, we can reasonably reject the nullhypothesis that there is no difference between the two diagnoses.
To improve our model, we might want to include the information we have about age
and session
. Perhaps, the older participants do better than the younger? And perhaps, participants with the diagnosis 'b'
are better at learning over time (session
)? By including such information in our model, we might be better at predicting the score. We could also use participant
as a random effect, to factor out the personal differences.
Let’s list a bunch of possible model formulas that we can then compare with crossvalidation. The crossvalidation function needs the model to be passed in the format below (a string). Instead of looking at summaries for each model, we will find the best model with crossvalidation and only look at the summary for that one. Notice that when we want to compare models, we want to keep the same random effects for all the models, so we are only comparing the combination of fixed effects.
m0 < 'score ~ 1 + (1  participant)'
m1 < 'score ~ diagnosis + (1  participant)'
m2 < 'score ~ diagnosis + age + (1  participant)'
m3 < 'score ~ diagnosis + session + (1  participant)'
m4 < 'score ~ diagnosis * session + (1  participant)'
m5 < 'score ~ diagnosis * session + age + (1  participant)'
Now, let’s crossvalidate the 6 models. If you get a message saying "boundary (singular) fit"
, it’s because the model is too complex for our small dataset. To avoid this message, try removing the random effects from the models.
m0
#> [1] "score ~ 1 + (1  participant)"
crossvalidate(train_set, k = 4, model = m0, dependent = 'score', random = TRUE)
#> RMSE
#> 18.27222
m1
#> [1] "score ~ diagnosis + (1  participant)"
crossvalidate(train_set, k = 4, model = m1, dependent = 'score', random = TRUE)
#> RMSE
#> 14.7661
m2
#> [1] "score ~ diagnosis + age + (1  participant)"
crossvalidate(train_set, k = 4, model = m2, dependent = 'score', random = TRUE)
#> RMSE
#> 15.28533
m3
#> [1] "score ~ diagnosis + session + (1  participant)"
crossvalidate(train_set, k = 4, model = m3, dependent = 'score', random = TRUE)
#> RMSE
#> 6.176477
m4
#> [1] "score ~ diagnosis * session + (1  participant)"
crossvalidate(train_set, k = 4, model = m4, dependent = 'score', random = TRUE)
#> RMSE
#> 5.950014
m5
#> [1] "score ~ diagnosis * session + age + (1  participant)"
crossvalidate(train_set, k = 4, model = m5, dependent = 'score', random = TRUE)
#> RMSE
#> 7.004665
The model m4
has the lowest error on average and so, we assume that it is the best predictor of outofsample data. To make sure it doesn’t overfit the data, let’s look at the error on the test set.
# Training the model on the full training set
model_m4 < lmer(m4, train_set, REML = FALSE)
# Predict the dependent variable in the test set with the trained model
predicted < predict(model_m4, test_set, allow.new.levels = TRUE)
# Get the Root Mean Square Error between the predicted and the observed
RMSE < rmse(predicted, test_set[['score']])
RMSE
#> [1] 6.418155
6.42 is a bit higher than the crossvalidated average, but it’s so close that we’re not afraid of overfitting. Be aware that the scale of the RMSE is dependent on the scale of the dependent variable in our dataset, so you might find some models with much higher RMSEs than this one. What matters is the relative difference between models, and that the best model will have the lowest error.
Let’s look at its summary:
model_m4 %>%
summary()
#> Linear mixed model fit by maximum likelihood . ttests use Satterthwaite's
#> method [lmerModLmerTest]
#> Formula: m4
#> Data: train_set
#>
#> AIC BIC logLik deviance df.resid
#> 157.0 164.1 72.5 145.0 18
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> 1.8723 0.6998 0.0137 0.7042 1.6556
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> participant (Intercept) 3.714 1.927
#> Residual 21.398 4.626
#> Number of obs: 24, groups: participant, 8
#>
#> Fixed effects:
#> Estimate Std. Error df t value Pr(>t)
#> (Intercept) 0.1667 3.6621 22.2764 0.046 0.9641
#> diagnosisb 9.4167 5.1790 22.2764 1.818 0.0825 .
#> session 13.2500 1.6355 16.0000 8.102 4.71e07 ***
#> diagnosisb:session 6.3750 2.3129 16.0000 2.756 0.0141 *
#> 
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr) dgnssb sessin
#> diagnosisb 0.707
#> session 0.893 0.632
#> dgnssb:sssn 0.632 0.893 0.707
In this model, we have a significant interaction between diagnosis and session. The interpretation of this result would be quite different from that of the first model, we tried.
Notice that the crossvalidated m3
model is so close to the crossvalidated m4
model that we can’t be completely sure which is better. The observed differences might as well stem from the randomness in the partition()
or fold()
steps. In such a case, we could turn to repeated crossvalidation, where we create multiple fold columns (by setting num_fold_cols
in fold()
) and average their crossvalidated results. If the models are still very close, we might consider reporting all three models or at least checking if the interpretations of the three models differ. These things are highly dependent on the context.
When working with linear and logistic regression, the cvms
package contains functions for (repeated) crossvalidation and is specifically designed to work well with groupdata2
.
Well done, you made it to the end of this introduction to groupdata2
! If you want to know more about the various methods and arguments, you can read the Description of groupdata2.
If you have any questions or comments to this vignette (tutorial) or groupdata2
, please send them to me at
[email protected], or open an issue on the github page https://github.com/LudvigOlsen/groupdata2 so I can make improvements.