r prediction cross-validation xgboost training-data

Training and predicting with Xgboost in R

I have one question related to the cross-validation, tuning, training and predicting of a model when using the package xgboost and the function xgb.cv in r.

In particular, I have re-used and adapted a code from internet in order to search for the best parameter in the parameter space (tuning) using xgb.cv in a classification problem.

Here you can find the code used to perform this task :

# *****************************
# *******  TUNING  ************
# *****************************
start_time <- Sys.time()

best_param <- list()
best_seednumber <- 1234
best_acc <- 0
best_acc_index <- 0

set.seed(1234)
# In reality, might need 100 or 200 iters
for (iter in 1:200) {
  param <- list(objective = "binary:logistic",
                eval_metric = c("error"),      # rmse is used for regression
                max_depth = sample(6:10, 1),
                eta = runif(1, .01, .1),   # Learning rate, default: 0.3
                subsample = runif(1, .6, .9),
                colsample_bytree = runif(1, .5, .8), 
                min_child_weight = sample(5:10, 1), # These two are important
                max_delta_step = sample(5:10, 1) # Can help to focus error
                # into a small range.
  )
  cv.nround <-  1000
  cv.nfold <-  10 # 10-fold cross-validation
  seed.number  <-  sample.int(10000, 1) # set seed for the cv
  set.seed(seed.number)
  mdcv <- xgb.cv(data = dtrain, params = param,  
                 nfold = cv.nfold, nrounds = cv.nround,
                 verbose = F, early_stopping_rounds = 20, maximize = FALSE,
                 stratified = T)

  max_acc_index  <-  mdcv$best_iteration
  max_acc <- 1 - mdcv$evaluation_log[mdcv$best_iteration]$test_error_mean
  print(i)
  print(max_acc)
  print(mdcv$evaluation_log[mdcv$best_iteration])

  if (max_acc > best_acc) {
    best_acc <- max_acc
    best_acc_index <- max_acc_index
    best_seednumber <- seed.number
    best_param <- param
  }
}

end_time <- Sys.time()

print(end_time - start_time)    # Duration -> 1.54796 hours

After about 1.5 hours this code gives me back the best performing parameters in the cross-validation setting. I am also able to reproduce the accuracy obtained in the loop and the best parameters.

# Reproduce what found in loop
set.seed(best_seednumber)
best_model_cv <- xgb.cv(data=dtrain, params=best_param, nfold=cv.nfold, nrounds=cv.nround,
                     verbose = T, early_stopping_rounds = 20, maximize = F, stratified = T,
                     prediction=TRUE)
print(best_model_cv)
best_model_cv$params

Now I want to use this "best parameters" in order to train my full training set using either xgboost or xgb.train and make prediction on a test data set.

best_model <- xgboost(params = best_param, data=dtrain,
                      seed=best_seednumber, nrounds=10)

At this point, I am not sure if this code for training is correct and what are the parameters that I should use within xgboost. The problem is that when I run this training and than I make my predictions in the test data set, my classifier basically classifies almost all new instances in a single class (which is not possible because I have also used other models which in principle gives accurate classification rates).

So, to sum up, my questions are:

How can I use the training parameters obtained from the cross-validation phase in the training function of the package xgboost?
Since I am fairly new in this field, can you confirm that I should pre-process my test data set as I have pre-processed my training data set (transformations, feature engineering and so on)?

I know that my code is not reproducible but I am more interested into the use of the function so I guess at this stage this is not crucial.

Thank you.

Solution

At the end it was an error in the definition of my test data set that generated the problem. There is nothing wrong with the way I defined the parameters of the training model.