r cross-validation r-caret ensemble-learning

Retrieving predictions for hold-out folds in caret

I am wondering how to recover the cross-validation predictions. I am interested in building a stacking model manually (like here in point 3.2.1) and I would need the model's predictions for each of the hold-out folds. I am attaching a short example.

# load the library
library(caret)
# load the iris dataset
data(cars)
# define folds
cv_folds <- createFolds(cars$Price, k = 5, list = TRUE)
# define training control
train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final')
# fix the parameters of the algorithm
# train the model
model <- caret::train(Price~., data=cars, trControl=train_control, method="gbm", verbose = F)
# looking at predictions
model$pred

# verifying the number of observations
nrow(model$pred[model$pred$Resample == "Fold1",])
nrow(cars)

I would like to know what are the predictions that come from estimating the model on folds 1-4 and evaluating on fold 5 etc. Looking at model$pred does not seem to give me what I need.

Solution

When performing CV in caret with folds created by createFolds function by default the train indexes are used. So when you did:

cv_folds <- createFolds(cars$Price, k = 5, list = TRUE)

you received train set folds

lengths(cv_folds)
#output
Fold1 Fold2 Fold3 Fold4 Fold5 
  161   160   161   160   162

each containing 20% of your data

then you specified these folds in trainControl:

train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final')

from the help of trainControl:

index - a list with elements for each resampling iteration. Each list element is a vector of integers corresponding to the rows used for training at that iteration.

indexOut - a list (the same length as index) that dictates which data are held-out for each resample (as integers). If NULL, then the unique set of samples not contained in index is used.

So each time the model was built on 160 rows and validated on the rest. This is why

nrow(model$pred[model$pred$Resample == "Fold1",])

returns 643

What you should do is:

cv_folds <- createFolds(cars$Price, k = 5, list = TRUE, returnTrain = TRUE)

now:

lengths(cv_folds)
#output
Fold1 Fold2 Fold3 Fold4 Fold5 
  644   643   642   644   643

and after training the model:

nrow(model$pred[model$pred$Resample == "Fold1",])
#output
160