I am wondering how to recover the cross-validation predictions. I am interested in building a stacking model manually (like here in point 3.2.1) and I would need the model's predictions for each of the hold-out folds. I am attaching a short example.
# load the library
library(caret)
# load the iris dataset
data(cars)
# define folds
cv_folds <- createFolds(cars$Price, k = 5, list = TRUE)
# define training control
train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final')
# fix the parameters of the algorithm
# train the model
model <- caret::train(Price~., data=cars, trControl=train_control, method="gbm", verbose = F)
# looking at predictions
model$pred
# verifying the number of observations
nrow(model$pred[model$pred$Resample == "Fold1",])
nrow(cars)
I would like to know what are the predictions that come from estimating the model on folds 1-4 and evaluating on fold 5 etc. Looking at model$pred
does not seem to give me what I need.
When performing CV in caret with folds created by createFolds
function by default the train indexes are used. So when you did:
cv_folds <- createFolds(cars$Price, k = 5, list = TRUE)
you received train set folds
lengths(cv_folds)
#output
Fold1 Fold2 Fold3 Fold4 Fold5
161 160 161 160 162
each containing 20% of your data
then you specified these folds in trainControl:
train_control <- trainControl(method="cv", index = cv_folds, savePredictions = 'final')
from the help of trainControl
:
index - a list with elements for each resampling iteration. Each list element is a vector of integers corresponding to the rows used for training at that iteration.
indexOut - a list (the same length as index) that dictates which data are held-out for each resample (as integers). If NULL, then the unique set of samples not contained in index is used.
So each time the model was built on 160 rows and validated on the rest. This is why
nrow(model$pred[model$pred$Resample == "Fold1",])
returns 643
What you should do is:
cv_folds <- createFolds(cars$Price, k = 5, list = TRUE, returnTrain = TRUE)
now:
lengths(cv_folds)
#output
Fold1 Fold2 Fold3 Fold4 Fold5
644 643 642 644 643
and after training the model:
nrow(model$pred[model$pred$Resample == "Fold1",])
#output
160