R xgboost xgb.cv pred values: best iteration or final iteration?

I am using the xgb.cv function to grid search best hyperparameters in the R implementation of xgboost. When setting predictions to TRUE, it supplies the predictions for the out of fold observations. Presuming you are using early stopping, do the predictions correspond to predictions at the best iteration or are they the predictions of the final iteration?

Solution

CV predictions correspond to the best iteration - you can see this using a 'strict' early_stopping value, then comparing the predictions with those made using models trained with the 'best' number of iterations and 'final' number of iterations, eg:

# Load minimum reproducible example
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(train$data, label=train$label)
test <- agaricus.test
dtest <- xgb.DMatrix(test$data, label=test$label)

# Perform cross validation with a 'strict' early_stopping
cv <- xgb.cv(data = train$data, label = train$label, nfold = 5, max_depth = 2,
             eta = 1, nthread = 4, nrounds = 10, objective = "binary:logistic",
             prediction = TRUE, early_stopping_rounds = 1)

# Check which round was the best iteration (the one that initiated the early stopping)
print(cv$best_iteration)
[1] 3

# Get the predictions
head(cv$pred)
[1] 0.84574515 0.15447612 0.15390711 0.84502697 0.09661318 0.15447612

# Train a model using 3 rounds (corresponds to best iteration)
trained_model <- xgb.train(data = dtrain, max_depth = 2,
              eta = 1, nthread = 4, nrounds = 3,
              watchlist = list(train = dtrain, eval = dtrain),
              objective = "binary:logistic")
# Get predictions
head(predict(trained_model, dtrain))
[1] 0.84625006 0.15353635 0.15353635 0.84625006 0.09530514 0.15353635

# Train a model using 10 rounds (corresponds to final iteration)
trained_model <- xgb.train(data = dtrain, max_depth = 2,
              eta = 1, nthread = 4, nrounds = 10,
              watchlist = list(train = dtrain, eval = dtrain),
              objective = "binary:logistic")
head(predict(trained_model, dtrain))
[1] 0.9884467125 0.0123147098 0.0050151693 0.9884467125 0.0008781737 0.0123147098

So the predictions from the CV are ~the same as the predictions made when the number of iterations is 'best', not 'final'.