Search code examples
rcross-validationr-caretdata-partitioning

Specifiying a selected range of data to be used in leave-one-out (jack-knife) cross-validation for use in the caret::train function


This question builds on the question that I asked here: Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation).

The data I am working with looks like this:

df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(c(1:20,1:20), each = 5), Replicate = c(1:5))

Essentially what I would like to do is create custom partitions, like those generated by the caret::groupKFold function but for these folds to be over a specified range (i.e. > 15 days) and for each fold to with-hold one point to be a test set and with all other data to be used for training. This would be repeated at each iteration till every point in the specified range has been used as a test set. @Missuse wrote some code towards this end which gets close to the desired output for this question in the above link.

I would try and show you the desired output but in all honesty the caret::groupKFold functions output confuses me so hopefully the above description will suffice. Happy to try and clarify though!


Solution

  • Here is one way you could create the desired partition using tidyverse:

    library(tidyverse)
    
    df %>%
      mutate(id = row_number()) %>% #create a column called id which will hold the row numbers
      filter(Time > 15) %>% #subset data frame according to your description 
      split(.$id)  %>% #split the data frame into lists by id (row number)
      map(~ .x %>% select(id) %>% #clean up so it works with indexOut argument in trainControl
            unlist %>%
            unname) -> folds_cv
    

    EDIT: it seems indexOut argument does not perform as expected, but the index argument does so after making folds_cv one can just get the inverse using setdiff:

    folds_cv <- lapply(folds_cv, function(x) setdiff(1:nrow(df), x))
    

    and now:

    test_control <- trainControl(index = folds_cv,
                                 savePredictions = "final")
    
    
    quad.lm2 <- train(Time ~ Effect,
                      data = df,
                      method = "lm",
                      trControl = test_control)
    

    with a warning:

    Warning message:
    In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
      There were missing values in resampled performance measures.
    > quad.lm2
    Linear Regression 
    
    200 samples
      1 predictor
    
    No pre-processing
    Resampling: Bootstrapped (50 reps) 
    Summary of sample sizes: 199, 199, 199, 199, 199, 199, ... 
    Resampling results:
    
      RMSE          Rsquared  MAE         
      3.552714e-16  NaN       3.552714e-16
    
    Tuning parameter 'intercept' was held constant at a value of TRUE
    

    so each re-sample used 199 rows and predicted on 1, repeating for all 50 rows which we wanted to hold out at a time. This can be verified in:

    quad.lm2$pred
    

    Why Rsquared is missing I am not sure I will dig a bit deeper.