r cross-validation r-caret data-partitioning

Specifiying a selected range of data to be used in leave-one-out (jack-knife) cross-validation for use in the caret::train function

This question builds on the question that I asked here: Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation).

The data I am working with looks like this:

df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(c(1:20,1:20), each = 5), Replicate = c(1:5))

Essentially what I would like to do is create custom partitions, like those generated by the caret::groupKFold function but for these folds to be over a specified range (i.e. > 15 days) and for each fold to with-hold one point to be a test set and with all other data to be used for training. This would be repeated at each iteration till every point in the specified range has been used as a test set. @Missuse wrote some code towards this end which gets close to the desired output for this question in the above link.

I would try and show you the desired output but in all honesty the caret::groupKFold functions output confuses me so hopefully the above description will suffice. Happy to try and clarify though!

Solution

Here is one way you could create the desired partition using tidyverse:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #create a column called id which will hold the row numbers
  filter(Time > 15) %>% #subset data frame according to your description 
  split(.$id)  %>% #split the data frame into lists by id (row number)
  map(~ .x %>% select(id) %>% #clean up so it works with indexOut argument in trainControl
        unlist %>%
        unname) -> folds_cv

EDIT: it seems indexOut argument does not perform as expected, but the index argument does so after making folds_cv one can just get the inverse using setdiff:

folds_cv <- lapply(folds_cv, function(x) setdiff(1:nrow(df), x))

and now:

test_control <- trainControl(index = folds_cv,
                             savePredictions = "final")


quad.lm2 <- train(Time ~ Effect,
                  data = df,
                  method = "lm",
                  trControl = test_control)

with a warning:

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
> quad.lm2
Linear Regression 

200 samples
  1 predictor

No pre-processing
Resampling: Bootstrapped (50 reps) 
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ... 
Resampling results:

  RMSE          Rsquared  MAE         
  3.552714e-16  NaN       3.552714e-16

Tuning parameter 'intercept' was held constant at a value of TRUE

so each re-sample used 199 rows and predicted on 1, repeating for all 50 rows which we wanted to hold out at a time. This can be verified in:

quad.lm2$pred

Why Rsquared is missing I am not sure I will dig a bit deeper.