I have the following XGBoost C.V. model.
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 20,
nfold = 3,
metrics = "auc",
verbose = TRUE,
"eval_metric" = "auc",
"objective" = "binary:logistic",
"max.depth" = 6,
"eta" = 0.01,
"subsample" = 0.5,
"colsample_bytree" = 1,
print_every_n = 1,
"min_child_weight" = 1,
booster = "gbtree",
early_stopping_rounds = 10,
watchlist = watchlist,
seed = 1234)
My question is regarding the output and nfold
of the model, I set nfold
to 3
The output of the evaluation log looks as follows;
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1 1 0.8852290 0.0023585703 0.8598630 0.005515424
2 2 0.9015413 0.0018569007 0.8792137 0.003765109
3 3 0.9081027 0.0014307577 0.8859040 0.005053600
4 4 0.9108463 0.0011838160 0.8883130 0.004324113
5 5 0.9130350 0.0008863908 0.8904100 0.004173123
6 6 0.9143187 0.0009514359 0.8910723 0.004372844
7 7 0.9151723 0.0010543653 0.8917300 0.003905284
8 8 0.9162787 0.0010344935 0.8929013 0.003582747
9 9 0.9173673 0.0010539116 0.8935753 0.003431949
10 10 0.9178743 0.0011498505 0.8942567 0.002955511
11 11 0.9182133 0.0010825702 0.8944377 0.003051411
12 12 0.9185767 0.0011846632 0.8946267 0.003026969
13 13 0.9186653 0.0013352629 0.8948340 0.002526793
14 14 0.9190500 0.0012537195 0.8954053 0.002636388
15 15 0.9192453 0.0010967155 0.8954127 0.002841402
16 16 0.9194953 0.0009818501 0.8956447 0.002783787
17 17 0.9198503 0.0009541517 0.8956400 0.002590862
18 18 0.9200363 0.0009890185 0.8957223 0.002580398
19 19 0.9201687 0.0010323405 0.8958790 0.002508695
20 20 0.9204030 0.0009725742 0.8960677 0.002581329
However I set nrounds = 20
but cross validation nfolds
= 3 so should I have an output of 60 results and not 20?
Or is the above output just as the column names suggest, the mean score of the AUC at each round...
So at nround = 1
for the training set the train_auc_mean
is the result 0.8852290
which would be the average of the 3 cross validation nfolds
?
So if I plot these AUC scores then I would be plotting the average AUC score over the 3 fold cross validation?
Just want to make sure everything is clear.
You are correct that the output is the average of the fold auc
. However if you wish to extract the individual fold auc for the best/last iteration you can proceed as follows:
An example using the Sonar data set from mlbench
library(xgboost)
library(tidyverse)
library(mlbench)
data(Sonar)
xgb.train.data <- xgb.DMatrix(as.matrix(Sonar[,1:60]), label = as.numeric(Sonar$Class)-1)
param <- list(objective = "binary:logistic")
in xgb.cv
set prediction = TRUE
model.cv <- xgb.cv(param = param,
data = xgb.train.data,
nrounds = 50,
early_stopping_rounds = 10,
nfold = 3,
prediction = TRUE,
eval_metric = "auc")
now go over the folds and connect the predictions with the true lables and corresponding indexes:
z <- lapply(model.cv$folds, function(x){
pred <- model.cv$pred[x]
true <- (as.numeric(Sonar$Class)-1)[x]
index <- x
out <- data.frame(pred, true, index)
out
})
give the folds names:
names(z) <- paste("folds", 1:3, sep = "_")
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc())
#output
# A tibble: 3 x 2
id auroc
<chr> <dbl>
1 folds_1 0.944
2 folds_2 0.900
3 folds_3 0.899
the mean of these values is the same as the mean auc at best iteration:
z %>%
bind_rows(.id = "id") %>%
group_by(id) %>%
summarise(auroc = roc(true, pred) %>%
auc()) %>%
pull(auroc) %>%
mean
#output
[1] 0.9143798
model.cv$evaluation_log[model.cv$best_iteration,]
#output
iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1: 48 1 0 0.91438 0.02092817
You can of course do much more like plot auc curves for each fold and so on.