I have one question related to the cross-validation, tuning, training and predicting of a model when using the package xgboost and the function xgb.cv
in r.
In particular, I have re-used and adapted a code from internet in order to search for the best parameter in the parameter space (tuning) using xgb.cv
in a classification problem.
Here you can find the code used to perform this task :
# *****************************
# ******* TUNING ************
# *****************************
start_time <- Sys.time()
best_param <- list()
best_seednumber <- 1234
best_acc <- 0
best_acc_index <- 0
set.seed(1234)
# In reality, might need 100 or 200 iters
for (iter in 1:200) {
param <- list(objective = "binary:logistic",
eval_metric = c("error"), # rmse is used for regression
max_depth = sample(6:10, 1),
eta = runif(1, .01, .1), # Learning rate, default: 0.3
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(5:10, 1), # These two are important
max_delta_step = sample(5:10, 1) # Can help to focus error
# into a small range.
)
cv.nround <- 1000
cv.nfold <- 10 # 10-fold cross-validation
seed.number <- sample.int(10000, 1) # set seed for the cv
set.seed(seed.number)
mdcv <- xgb.cv(data = dtrain, params = param,
nfold = cv.nfold, nrounds = cv.nround,
verbose = F, early_stopping_rounds = 20, maximize = FALSE,
stratified = T)
max_acc_index <- mdcv$best_iteration
max_acc <- 1 - mdcv$evaluation_log[mdcv$best_iteration]$test_error_mean
print(i)
print(max_acc)
print(mdcv$evaluation_log[mdcv$best_iteration])
if (max_acc > best_acc) {
best_acc <- max_acc
best_acc_index <- max_acc_index
best_seednumber <- seed.number
best_param <- param
}
}
end_time <- Sys.time()
print(end_time - start_time) # Duration -> 1.54796 hours
After about 1.5 hours this code gives me back the best performing parameters in the cross-validation setting. I am also able to reproduce the accuracy obtained in the loop and the best parameters.
# Reproduce what found in loop
set.seed(best_seednumber)
best_model_cv <- xgb.cv(data=dtrain, params=best_param, nfold=cv.nfold, nrounds=cv.nround,
verbose = T, early_stopping_rounds = 20, maximize = F, stratified = T,
prediction=TRUE)
print(best_model_cv)
best_model_cv$params
Now I want to use this "best parameters" in order to train my full training set using either xgboost
or xgb.train
and make prediction on a test data set.
best_model <- xgboost(params = best_param, data=dtrain,
seed=best_seednumber, nrounds=10)
At this point, I am not sure if this code for training is correct and what are the parameters that I should use within xgboost
. The problem is that when I run this training and than I make my predictions in the test data set, my classifier basically classifies almost all new instances in a single class (which is not possible because I have also used other models which in principle gives accurate classification rates).
So, to sum up, my questions are:
How can I use the training parameters obtained from the cross-validation phase in the training function of the package xgboost?
Since I am fairly new in this field, can you confirm that I should pre-process my test data set as I have pre-processed my training data set (transformations, feature engineering and so on)?
I know that my code is not reproducible but I am more interested into the use of the function so I guess at this stage this is not crucial.
Thank you.
At the end it was an error in the definition of my test data set that generated the problem. There is nothing wrong with the way I defined the parameters of the training model.