r cross-validation r-caret data-partitioning

Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.

df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
     Time = seq(1:20))

The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold() function but I am not sure how. Any help would be greatly appreciated.

An example of what each CV fold would comprise:

TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)

TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)

TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)

TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)

TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)

Albeit in the format that the caret::groupKFold function outputs, so that the folds could be fed into the caret::train function:

CVFolds <- caret::groupKFold(df$Time)
CVFolds

Thanks in advance!

Solution

For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse. One approach to your problem would be:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #use the row number as a column called id
  filter(Time > 15) %>% #filter Time as per your need
  split(.$Time)  %>% #split df to a list by Time
  map(~ .x %>% select(id)) #select row numbers for each list element

example with two rows per each time:

df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
                 Time = rep(1:20, each = 2))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 15) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id)) -> test_folds

test_folds
#output
$`16`
  id
1 31
2 32

$`17`
  id
3 33
4 34

$`18`
  id
5 35
6 36

$`19`
  id
7 37
8 38

$`20`
   id
9  39
10 40

with unequal number of rows per time

df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
                 Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id))

$`2`
  id
1  6
2  7
3  8

$`3`
  id
4  9
5 10

Now you can define these hold out folds inside trainControl with the argument indexOut.

EDIT: to get similar output as caret::groupKFold one can:

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>%
        select(id) %>%
        unlist %>%
        unname) %>%
  unname