Search code examples
rmachine-learningxgboostcross-validation

Understanding xgboost cross validation and AUC output results


I have the following XGBoost C.V. model.

xgboostModelCV <- xgb.cv(data =  dtrain, 
                             nrounds = 20, 
                             nfold = 3, 
                             metrics = "auc", 
                             verbose = TRUE, 
                             "eval_metric" = "auc",
                             "objective" = "binary:logistic", 
                             "max.depth" = 6, 
                             "eta" = 0.01,                               
                             "subsample" = 0.5, 
                             "colsample_bytree" = 1,
                             print_every_n = 1, 
                             "min_child_weight" = 1,
                             booster = "gbtree",
                             early_stopping_rounds = 10,
                             watchlist = watchlist,
                             seed = 1234)

My question is regarding the output and nfold of the model, I set nfold to 3

The output of the evaluation log looks as follows;

   iter train_auc_mean train_auc_std test_auc_mean test_auc_std
1     1      0.8852290  0.0023585703     0.8598630  0.005515424
2     2      0.9015413  0.0018569007     0.8792137  0.003765109
3     3      0.9081027  0.0014307577     0.8859040  0.005053600
4     4      0.9108463  0.0011838160     0.8883130  0.004324113
5     5      0.9130350  0.0008863908     0.8904100  0.004173123
6     6      0.9143187  0.0009514359     0.8910723  0.004372844
7     7      0.9151723  0.0010543653     0.8917300  0.003905284
8     8      0.9162787  0.0010344935     0.8929013  0.003582747
9     9      0.9173673  0.0010539116     0.8935753  0.003431949
10   10      0.9178743  0.0011498505     0.8942567  0.002955511
11   11      0.9182133  0.0010825702     0.8944377  0.003051411
12   12      0.9185767  0.0011846632     0.8946267  0.003026969
13   13      0.9186653  0.0013352629     0.8948340  0.002526793
14   14      0.9190500  0.0012537195     0.8954053  0.002636388
15   15      0.9192453  0.0010967155     0.8954127  0.002841402
16   16      0.9194953  0.0009818501     0.8956447  0.002783787
17   17      0.9198503  0.0009541517     0.8956400  0.002590862
18   18      0.9200363  0.0009890185     0.8957223  0.002580398
19   19      0.9201687  0.0010323405     0.8958790  0.002508695
20   20      0.9204030  0.0009725742     0.8960677  0.002581329

However I set nrounds = 20 but cross validation nfolds = 3 so should I have an output of 60 results and not 20?

Or is the above output just as the column names suggest, the mean score of the AUC at each round...

So at nround = 1 for the training set the train_auc_mean is the result 0.8852290 which would be the average of the 3 cross validation nfolds?

So if I plot these AUC scores then I would be plotting the average AUC score over the 3 fold cross validation?

Just want to make sure everything is clear.


Solution

  • You are correct that the output is the average of the fold auc. However if you wish to extract the individual fold auc for the best/last iteration you can proceed as follows:

    An example using the Sonar data set from mlbench

    library(xgboost)
    library(tidyverse)
    library(mlbench)
    
    data(Sonar)
    
    xgb.train.data <- xgb.DMatrix(as.matrix(Sonar[,1:60]), label = as.numeric(Sonar$Class)-1)
    param <- list(objective = "binary:logistic")
    

    in xgb.cv set prediction = TRUE

    model.cv <- xgb.cv(param = param,
                       data = xgb.train.data,
                       nrounds = 50,
                       early_stopping_rounds = 10,
                       nfold = 3,
                       prediction = TRUE,
                       eval_metric = "auc")
    

    now go over the folds and connect the predictions with the true lables and corresponding indexes:

    z <- lapply(model.cv$folds, function(x){
      pred <- model.cv$pred[x]
      true <- (as.numeric(Sonar$Class)-1)[x]
      index <- x
      out <- data.frame(pred, true, index)
      out
    })
    

    give the folds names:

    names(z) <- paste("folds", 1:3, sep = "_")
    
    z %>%
      bind_rows(.id = "id") %>%
      group_by(id) %>%
      summarise(auroc = roc(true, pred) %>%
               auc())
    #output
    # A tibble: 3 x 2
      id      auroc
      <chr>   <dbl>
    1 folds_1 0.944
    2 folds_2 0.900
    3 folds_3 0.899
    

    the mean of these values is the same as the mean auc at best iteration:

    z %>%
      bind_rows(.id = "id") %>%
      group_by(id) %>%
      summarise(auroc = roc(true, pred) %>%
               auc()) %>%
      pull(auroc) %>%
      mean
    #output
    [1] 0.9143798
    
    model.cv$evaluation_log[model.cv$best_iteration,]
    #output
       iter train_auc_mean train_auc_std test_auc_mean test_auc_std
    1:   48              1             0       0.91438   0.02092817
    

    You can of course do much more like plot auc curves for each fold and so on.