Search code examples
rmachine-learningcross-validationr-caret

caret: combine the stratified createMultiFolds (repeatedCV) and groupKFold


My question is very similar to the one asked in caret: combine createResample and groupKFold

The only difference: I need to create stratified folds (also repeated 10 times) after grouping instead of bootstrapped resamples (which are not stratified as far as I know) for using it with caret's trainControl. The following code is working with 10-fold repeated CV but I couldn't include the grouping of the data based on an "ID" (df$ID).

# creating indices
cv.10.folds <- createMultiFolds(rf_label, k = 10, times = 10)
# creating folds    
ctrl.10fold <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)
# train
rf.ctrl10 <- train(rf_train, y = rf_label, method = "rf", tuneLength = 6,
                       ntree = 1000, trControl = ctrl.10fold, importance = TRUE)

That's my actual problem: My data contains many groups composed of 20 instances each, having the same "ID". So, when using the 10-fold CV repeated 10 times I get some instances of a group in the training and some in the validation set. This I want to avoid, but overall I need a stratified partitioning for the prediction value (df$Label). (All instances having the same "ID" also have the same prediction/label value.)

In the provided and accepted answer from the link above (see parts below) I guess I have to modify the folds2 line to contain the stratified 10-fold CV instead of the bootstrapped

folds <- groupKFold(x)
folds2 <- lapply(folds, function(x) lapply(1:10, function(i) sample(x, size = length(x), replace = TRUE)))

but unfortunately I cannot figure out how exactly. Could you help me with that?


Solution

  • Here is an approach to perform stratified repeated K-fold CV with blocking.

    library(caret)
    library(tidyverse)
    

    some fake data where id will be the blocking factor:

    id <- sample(1:55, size = 1000, replace = T)
    y <- rnorm(1000)
    x <- matrix(rnorm(10000), ncol = 10)
    df <- data.frame(id, y, x)
    

    summarise the observations by the blocking factor:

    df %>%
      group_by(id) %>%
      summarise(mean = mean(y)) %>%
      ungroup() -> groups1 
    

    create the stratified folds based on the grouped data:

    folds <- createMultiFolds(groups1$mean, 10, 3)
    

    back join the original df to the group data and take the df row id's

    folds <- lapply(folds, function(i){
      data.frame(id = i) %>%
        left_join(df %>%
                    rowid_to_column()) %>%
        pull(rowid) 
    })
    

    check if the data id's in the test are not in the train:

    lapply(folds, function(i){
      sum(df[i,1] %in% df[-i,1])
    })
    

    output is a bunch of zeros, meaning no id's in the test folds are in the train folds.

    If your group id's are not numeric there are two approaches to make this work:
    1 convert them to numeric:

    first some data

    id <- sample(1:55, size = 1000, replace = T)
    y <- rnorm(1000)
    x <- matrix(rnorm(10000), ncol = 10)
    df <- data.frame(id = paste0("id_", id), y, x) #factor id's
    
    df %>%
      mutate(id = as.numeric(id)) %>% #convert to numeric
      group_by(id) %>%
      summarise(mean = mean(y)) %>%
      ungroup() -> groups1 
    
    folds <- createMultiFolds(groups1$mean, 10, 3)
    
    folds <- lapply(folds, function(i){
      data.frame(id = i) %>%
        left_join(df %>%
                    mutate(id = as.numeric(id)) %>% #also need to convert to numeric in the original data frame
                    rowid_to_column()) %>%
        pull(rowid) 
    })  
    

    2 filter the id's in grouped data according to fold indexes and then join by id's

    df %>%
      group_by(id) %>%
      summarise(mean = mean(y)) %>%
      ungroup() -> groups1 
    
    folds <- createMultiFolds(groups1$mean, 10, 3)
    
    folds <- lapply(folds, function(i){
      groups1 %>% #start from grouped data
        select(id) %>% #select id's
        slice(i) %>% #filter id's according to fold index
        left_join(df %>% #join by id 
                   rowid_to_column()) %>%
        pull(rowid) 
    })
    

    Will it work for caret?

    ctrl.10fold <- trainControl(method = "repeatedcv", number = 10, repeats = 3, index = folds)
    
    rf.ctrl10 <- train(x = df[,-c(1:2)], y = df$y, data = df, method = "rf", tuneLength = 1,
                       ntree = 20, trControl = ctrl.10fold, importance = TRUE)
    
    rf.ctrl10$results
    #output
      mtry     RMSE    Rsquared       MAE     RMSESD  RsquaredSD      MAESD
    1    3 1.041641 0.007534611 0.8246514 0.06953668 0.009488169 0.05934975
    

    Also I suggest you check out library mlr, it has many nice features including blocking - here is one answer on SO. It has very nice tutorials on many things. For a long time I thought you either use caret or mlr but they complement each other very nicely.