r machine-learning cross-validation r-caret

caret: combine the stratified createMultiFolds (repeatedCV) and groupKFold

My question is very similar to the one asked in caret: combine createResample and groupKFold

The only difference: I need to create stratified folds (also repeated 10 times) after grouping instead of bootstrapped resamples (which are not stratified as far as I know) for using it with caret's trainControl. The following code is working with 10-fold repeated CV but I couldn't include the grouping of the data based on an "ID" (df$ID).

# creating indices
cv.10.folds <- createMultiFolds(rf_label, k = 10, times = 10)
# creating folds    
ctrl.10fold <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)
# train
rf.ctrl10 <- train(rf_train, y = rf_label, method = "rf", tuneLength = 6,
                       ntree = 1000, trControl = ctrl.10fold, importance = TRUE)

That's my actual problem: My data contains many groups composed of 20 instances each, having the same "ID". So, when using the 10-fold CV repeated 10 times I get some instances of a group in the training and some in the validation set. This I want to avoid, but overall I need a stratified partitioning for the prediction value (df$Label). (All instances having the same "ID" also have the same prediction/label value.)

In the provided and accepted answer from the link above (see parts below) I guess I have to modify the folds2 line to contain the stratified 10-fold CV instead of the bootstrapped

folds <- groupKFold(x)
folds2 <- lapply(folds, function(x) lapply(1:10, function(i) sample(x, size = length(x), replace = TRUE)))

but unfortunately I cannot figure out how exactly. Could you help me with that?

Solution

Here is an approach to perform stratified repeated K-fold CV with blocking.

library(caret)
library(tidyverse)

some fake data where id will be the blocking factor:

id <- sample(1:55, size = 1000, replace = T)
y <- rnorm(1000)
x <- matrix(rnorm(10000), ncol = 10)
df <- data.frame(id, y, x)

summarise the observations by the blocking factor:

df %>%
  group_by(id) %>%
  summarise(mean = mean(y)) %>%
  ungroup() -> groups1

create the stratified folds based on the grouped data:

folds <- createMultiFolds(groups1$mean, 10, 3)

back join the original df to the group data and take the df row id's

folds <- lapply(folds, function(i){
  data.frame(id = i) %>%
    left_join(df %>%
                rowid_to_column()) %>%
    pull(rowid) 
})

check if the data id's in the test are not in the train:

lapply(folds, function(i){
  sum(df[i,1] %in% df[-i,1])
})

output is a bunch of zeros, meaning no id's in the test folds are in the train folds.

If your group id's are not numeric there are two approaches to make this work:
1 convert them to numeric:

first some data

id <- sample(1:55, size = 1000, replace = T)
y <- rnorm(1000)
x <- matrix(rnorm(10000), ncol = 10)
df <- data.frame(id = paste0("id_", id), y, x) #factor id's

df %>%
  mutate(id = as.numeric(id)) %>% #convert to numeric
  group_by(id) %>%
  summarise(mean = mean(y)) %>%
  ungroup() -> groups1 

folds <- createMultiFolds(groups1$mean, 10, 3)

folds <- lapply(folds, function(i){
  data.frame(id = i) %>%
    left_join(df %>%
                mutate(id = as.numeric(id)) %>% #also need to convert to numeric in the original data frame
                rowid_to_column()) %>%
    pull(rowid) 
})

2 filter the id's in grouped data according to fold indexes and then join by id's

df %>%
  group_by(id) %>%
  summarise(mean = mean(y)) %>%
  ungroup() -> groups1 

folds <- createMultiFolds(groups1$mean, 10, 3)

folds <- lapply(folds, function(i){
  groups1 %>% #start from grouped data
    select(id) %>% #select id's
    slice(i) %>% #filter id's according to fold index
    left_join(df %>% #join by id 
               rowid_to_column()) %>%
    pull(rowid) 
})

Will it work for caret?

ctrl.10fold <- trainControl(method = "repeatedcv", number = 10, repeats = 3, index = folds)

rf.ctrl10 <- train(x = df[,-c(1:2)], y = df$y, data = df, method = "rf", tuneLength = 1,
                   ntree = 20, trControl = ctrl.10fold, importance = TRUE)

rf.ctrl10$results
#output
  mtry     RMSE    Rsquared       MAE     RMSESD  RsquaredSD      MAESD
1    3 1.041641 0.007534611 0.8246514 0.06953668 0.009488169 0.05934975

Also I suggest you check out library mlr, it has many nice features including blocking - here is one answer on SO. It has very nice tutorials on many things. For a long time I thought you either use caret or mlr but they complement each other very nicely.