Search code examples
rcross-validationr-caretdata-partitioning

Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation


I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.

df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
     Time = seq(1:20))

The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold() function but I am not sure how. Any help would be greatly appreciated.

An example of what each CV fold would comprise:

TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)

TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)

TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)

TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)

TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)

Albeit in the format that the caret::groupKFold function outputs, so that the folds could be fed into the caret::train function:

CVFolds <- caret::groupKFold(df$Time)
CVFolds

Thanks in advance!


Solution

  • For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse. One approach to your problem would be:

    library(tidyverse)
    
    df %>%
      mutate(id = row_number()) %>% #use the row number as a column called id
      filter(Time > 15) %>% #filter Time as per your need
      split(.$Time)  %>% #split df to a list by Time
      map(~ .x %>% select(id)) #select row numbers for each list element
    

    example with two rows per each time:

    df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
                     Time = rep(1:20, each = 2))
    
    df %>%
      mutate(id = row_number()) %>%
      filter(Time > 15) %>%
      split(.$Time)  %>%
      map(~ .x %>% select(id)) -> test_folds
    
    test_folds
    #output
    $`16`
      id
    1 31
    2 32
    
    $`17`
      id
    3 33
    4 34
    
    $`18`
      id
    5 35
    6 36
    
    $`19`
      id
    7 37
    8 38
    
    $`20`
       id
    9  39
    10 40
    

    with unequal number of rows per time

    df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
                     Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))
    
    df %>%
      mutate(id = row_number()) %>%
      filter(Time > 1) %>%
      split(.$Time)  %>%
      map(~ .x %>% select(id))
    
    $`2`
      id
    1  6
    2  7
    3  8
    
    $`3`
      id
    4  9
    5 10
    

    Now you can define these hold out folds inside trainControl with the argument indexOut.

    EDIT: to get similar output as caret::groupKFold one can:

    df %>%
      mutate(id = row_number()) %>%
      filter(Time > 1) %>%
      split(.$Time)  %>%
      map(~ .x %>%
            select(id) %>%
            unlist %>%
            unname) %>%
      unname