Search code examples
rcross-validationr-caretensemble-learning

Retrieving predictions for hold-out folds in caret


I am wondering how to recover the cross-validation predictions. I am interested in building a stacking model manually (like here in point 3.2.1) and I would need the model's predictions for each of the hold-out folds. I am attaching a short example.

# load the library
library(caret)
# load the iris dataset
data(cars)
# define folds
cv_folds <- createFolds(cars$Price, k = 5, list = TRUE)
# define training control
train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final')
# fix the parameters of the algorithm
# train the model
model <- caret::train(Price~., data=cars, trControl=train_control, method="gbm", verbose = F)
# looking at predictions
model$pred

# verifying the number of observations
nrow(model$pred[model$pred$Resample == "Fold1",])
nrow(cars)

I would like to know what are the predictions that come from estimating the model on folds 1-4 and evaluating on fold 5 etc. Looking at model$pred does not seem to give me what I need.


Solution

  • When performing CV in caret with folds created by createFolds function by default the train indexes are used. So when you did:

    cv_folds <- createFolds(cars$Price, k = 5, list = TRUE)
    

    you received train set folds

    lengths(cv_folds)
    #output
    Fold1 Fold2 Fold3 Fold4 Fold5 
      161   160   161   160   162
    

    each containing 20% of your data

    then you specified these folds in trainControl:

    train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final')
    

    from the help of trainControl:

    index - a list with elements for each resampling iteration. Each list element is a vector of integers corresponding to the rows used for training at that iteration.

    indexOut - a list (the same length as index) that dictates which data are held-out for each resample (as integers). If NULL, then the unique set of samples not contained in index is used.

    So each time the model was built on 160 rows and validated on the rest. This is why

    nrow(model$pred[model$pred$Resample == "Fold1",])
    

    returns 643

    What you should do is:

    cv_folds <- createFolds(cars$Price, k = 5, list = TRUE, returnTrain = TRUE)
    

    now:

    lengths(cv_folds)
    #output
    Fold1 Fold2 Fold3 Fold4 Fold5 
      644   643   642   644   643 
    

    and after training the model:

    nrow(model$pred[model$pred$Resample == "Fold1",])
    #output
    160