I am trying to use R's package xgboost. But there is something I feel confused. In xgboost manual, under xgb.cv function, it says:
The original sample is randomly partitioned into nfold equal size subsamples.
Of the nfold subsamples, a single subsample is retained as the validation data for testing the model, and the remaining nfold - 1 subsamples are used as training data.
The cross-validation process is then repeated nrounds times, with each of the nfold subsamples used exactly once as the validation data.
And this is the code in the manual:
data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics =
list("rmse","auc"),
max_depth = 3, eta = 1, objective = "binary:logistic")
print(cv)
print(cv, verbose=TRUE)
And the result is:
##### xgb.cv 5-folds
call:
xgb.cv(data = dtrain, nrounds = 3, nfold = 5, metrics = list("rmse",
"auc"), nthread = 2, max_depth = 3, eta = 1, objective = "binary:logistic")
params (as set within xgb.cv):
nthread = "2", max_depth = "3", eta = "1", objective = "binary:logistic",
eval_metric = "rmse", eval_metric = "auc", silent = "1"
callbacks:
cb.print.evaluation(period = print_every_n, showsd = showsd)
cb.evaluation.log()
niter: 3
evaluation_log:
iter train_rmse_mean train_rmse_std train_auc_mean train_auc_std test_rmse_mean test_rmse_std test_auc_mean test_auc_std
1 0.1623756 0.002693092 0.9871108 1.123550e-03 0.1625222 0.009134128 0.9870954 0.0045008818
2 0.0784902 0.002413883 0.9998370 1.317346e-04 0.0791366 0.004566554 0.9997756 0.0003538184
3 0.0464588 0.005172930 0.9998942 7.315846e-05 0.0478028 0.007763252 0.9998902 0.0001347032
Let's say nfold=5 and nrounds=2. It means the data is splited into 5 parts with equal size. And the algorithm will generate 2 trees.
my understand is: each subsample has to be the validation once. When one subsample is validation, 2 trees will be generated. So, we will have 5 sets of trees (one set has 2 trees because nrounds=2). Then we check if the evaluation metric varies a lot or not.
But the result does not say the same way. one nround value has one line of the evaluation metric, which looks like it already includes the 'cross validation' part. So, if 'The cross-validation process is then repeated nrounds times', then how come 'with each of the nfold subsamples used exactly once as the validation data'?
Those are the means and standard deviations of the scores of the nfold fit-test procedures run at every round in nrounds. The XGBoost cross validation process proceeds like this:
1 Note that what I would call the 'validation' set is identified by XGBoost as the 'test' set in the evaluation log