Search code examples
rpredictioncross-validationxgboosttraining-data

Training and predicting with Xgboost in R


I have one question related to the cross-validation, tuning, training and predicting of a model when using the package and the function xgb.cv in .

In particular, I have re-used and adapted a code from internet in order to search for the best parameter in the parameter space (tuning) using xgb.cv in a classification problem.

Here you can find the code used to perform this task :

# *****************************
# *******  TUNING  ************
# *****************************
start_time <- Sys.time()

best_param <- list()
best_seednumber <- 1234
best_acc <- 0
best_acc_index <- 0

set.seed(1234)
# In reality, might need 100 or 200 iters
for (iter in 1:200) {
  param <- list(objective = "binary:logistic",
                eval_metric = c("error"),      # rmse is used for regression
                max_depth = sample(6:10, 1),
                eta = runif(1, .01, .1),   # Learning rate, default: 0.3
                subsample = runif(1, .6, .9),
                colsample_bytree = runif(1, .5, .8), 
                min_child_weight = sample(5:10, 1), # These two are important
                max_delta_step = sample(5:10, 1) # Can help to focus error
                # into a small range.
  )
  cv.nround <-  1000
  cv.nfold <-  10 # 10-fold cross-validation
  seed.number  <-  sample.int(10000, 1) # set seed for the cv
  set.seed(seed.number)
  mdcv <- xgb.cv(data = dtrain, params = param,  
                 nfold = cv.nfold, nrounds = cv.nround,
                 verbose = F, early_stopping_rounds = 20, maximize = FALSE,
                 stratified = T)

  max_acc_index  <-  mdcv$best_iteration
  max_acc <- 1 - mdcv$evaluation_log[mdcv$best_iteration]$test_error_mean
  print(i)
  print(max_acc)
  print(mdcv$evaluation_log[mdcv$best_iteration])

  if (max_acc > best_acc) {
    best_acc <- max_acc
    best_acc_index <- max_acc_index
    best_seednumber <- seed.number
    best_param <- param
  }
}

end_time <- Sys.time()

print(end_time - start_time)    # Duration -> 1.54796 hours

After about 1.5 hours this code gives me back the best performing parameters in the cross-validation setting. I am also able to reproduce the accuracy obtained in the loop and the best parameters.

# Reproduce what found in loop
set.seed(best_seednumber)
best_model_cv <- xgb.cv(data=dtrain, params=best_param, nfold=cv.nfold, nrounds=cv.nround,
                     verbose = T, early_stopping_rounds = 20, maximize = F, stratified = T,
                     prediction=TRUE)
print(best_model_cv)
best_model_cv$params

Now I want to use this "best parameters" in order to train my full training set using either xgboost or xgb.train and make on a test data set.

best_model <- xgboost(params = best_param, data=dtrain,
                      seed=best_seednumber, nrounds=10)

At this point, I am not sure if this code for training is correct and what are the parameters that I should use within xgboost. The problem is that when I run this training and than I make my predictions in the test data set, my classifier basically classifies almost all new instances in a single class (which is not possible because I have also used other models which in principle gives accurate classification rates).

So, to sum up, my questions are:

  1. How can I use the training parameters obtained from the cross-validation phase in the training function of the package ?

  2. Since I am fairly new in this field, can you confirm that I should pre-process my test data set as I have pre-processed my training data set (transformations, feature engineering and so on)?

I know that my code is not reproducible but I am more interested into the use of the function so I guess at this stage this is not crucial.

Thank you.


Solution

  • At the end it was an error in the definition of my test data set that generated the problem. There is nothing wrong with the way I defined the parameters of the training model.