Search code examples
rxgboostoverfitting-underfitting

Overfitting on the training data while still improving on the validation data


I am fitting a binary classification model with XGBoost in R. My dataset has 300k observations with 3 continious predictors and 1 one-hot-encoded factor variabele with 90 levels. The dependent variable y is True or False.

I have done random subsampling to find the optimal hyperparameters. For every setting I have done 5 fold (grouped) CV. The hyperparameter-settings below resulted in the highest average AUC on the 5 fold valuation data folds:

booster  objective        max_depth  eta        subsample  colsample_bytree   min_child_weight
gbtree   binary:logistic  8          0.7708479  0.2861735  0.5338721          1

Next I have used these hyperparamer-settings in the XGBoost-model fitting below:

model_n <- xgb.train(data = xgb_trainval,
                         booster = "gbtree",
                         objective = "binary:logistic",
                         max_depth = 8,
                         eta = 0.7708479,
                         subsample = 0.2861735,
                         colsample_bytree = 0.5338721,
                         min_child_weight = 1,
                         nrounds = 1000,
                         eval_metric = "auc",
                         early_stopping_rounds = 30,
                         print_every_n = 100,
                         watchlist = list(train = xgb_trainval, val = xgb_val)
    )

I have visualized the evaluation log in this way:

model_iterations <- model_n$evaluation_log$iter
model_train_auc <- model_n$evaluation_log$train_auc
model_val_auc <- model_n$evaluation_log$val_auc

enter image description here

I conclude that the model is overfitting on the training data since the AUC becomes close to 1 after 200 iterations. At the same time the model is still improving on the validation data. On one hand I would conclude that the model after 500 iterations can't be a good model since it is strongly overfitting on the training data. On the other hand this model has the highest AUC on the validation data.

Can this model be optimal if it is strongly overfitting on the train data as seen above or should I tune further to have a model which has less overfitting on the training data (with a similar or even slightly lower AUC on the validation data)?

Thank you!


Solution

  • Yes, this is a viable strategy, but have a final unseen test set.

    Also check on all the data, that you are happy with the observations, it scores well on vs the observations it does not.

    Are you satisfied with the cases, the model cannot handle?

    If not, train with weighting, so the important type of cases is handled well and the less important cases are perhaps not.