python machine-learning scikit-learn xgboost

Tuning XGBoost Hyperparameters with RandomizedSearchCV

I''m trying to use XGBoost for a particular dataset that contains around 500,000 observations and 10 features. I'm trying to do some hyperparameter tuning with RandomizedSeachCV, and the performance of the model with the best parameters is worse than the one of the model with the default parameters.

Model with default parameters:

model = XGBRegressor()
model.fit(X_train,y_train["speed"])
y_predict_speed = model.predict(X_test)

from sklearn.metrics import r2_score
print("R2 score:", r2_score(y_test["speed"],y_predict_speed, multioutput='variance_weighted'))
R2 score: 0.3540656307310167

Best model from random search:

booster=['gbtree','gblinear']
base_score=[0.25,0.5,0.75,1]

## Hyper Parameter Optimization
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]

# Define the grid of hyperparameters to search
hyperparameter_grid = {
    'n_estimators': n_estimators,
    'max_depth':max_depth,
    'learning_rate':learning_rate,
    'min_child_weight':min_child_weight,
    'booster':booster,
    'base_score':base_score
    }

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=regressor,
            param_distributions=hyperparameter_grid,
            cv=5, n_iter=50,
            scoring = 'neg_mean_absolute_error',n_jobs = 4,
            verbose = 5, 
            return_train_score = True,
            random_state=42)

random_cv.fit(X_train,y_train["speed"])

random_cv.best_estimator_

XGBRegressor(base_score=0.5, booster='gblinear', colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=None, gamma=None,
             gpu_id=-1, importance_type='gain', interaction_constraints=None,
             learning_rate=0.15, max_delta_step=None, max_depth=15,
             min_child_weight=3, missing=nan, monotone_constraints=None,
             n_estimators=500, n_jobs=16, num_parallel_tree=None,
             random_state=0, reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
             subsample=None, tree_method=None, validate_parameters=1,
             verbosity=None)

Using the best model:

regressor = XGBRegressor(base_score=0.5, booster='gblinear', colsample_bylevel=None,
             colsample_bynode=None, colsample_bytree=None, gamma=None,
             gpu_id=-1, importance_type='gain', interaction_constraints=None,
             learning_rate=0.15, max_delta_step=None, max_depth=15,
             min_child_weight=3, monotone_constraints=None,
             n_estimators=500, n_jobs=16, num_parallel_tree=None,
             random_state=0, reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
             subsample=None, tree_method=None, validate_parameters=1,
             verbosity=None)

regressor.fit(X_train,y_train["speed"])
y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score
print("R2 score:", r2_score(y_test["speed"],y_pred, multioutput='variance_weighted'))

R2 score: 0.14258774171629718

As you can see after 3 hours of running the randomized search the accuracy actually drops. If I change linear to tree the value goes up to 0.65, so why is the randomized search not working?

I'm also getting a warning with the following:

This may not be accurate due to some parameters are only used in language bindings but passed down to XGBoost core. Or some parameters are not used but slip through this verification. Please open an issue if you find above cases.

Does anyone have a suggestion regarding this hyperparameter tuning method?

Solution

As stated in the XGBoost Docs

Parameter tuning is a dark art in machine learning, the optimal parameters of a model can depend on many scenarios.

You asked for suggestions for your specific scenario, so here are some of mine.

Drop the dimensions booster from your hyperparameter search space. You probably want to go with the default booster 'gbtree'. If you are interested in the performance of a linear model you could just try linear or ridge regression, but don't bother with it during your XGBoost parameter tuning.
Drop the dimension base_score from your hyperparameter search space. This should not have much of an effect with sufficiently many boosting iterations (see XGB parameter docs).
Currently you have 3200 hyperparameter combinations in your grid. Expecting to find a good one by looking at 50 random ones might be a bit too optimistic. After dropping the booster and base_score dimensions you would be down to

hyperparameter_grid = {
    'n_estimators': [100, 500, 900, 1100, 1500],
    'max_depth': [2, 3, 5, 10, 15],
    'learning_rate': [0.05, 0.1, 0.15, 0.20],
    'min_child_weight': [1, 2, 3, 4]
    }

which has 400 possible combinations. For a first shot I would simplify this a bit more. For example you could try something like

hyperparameter_grid = {
    'n_estimators': [100, 400, 800],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.05, 0.1, 0.20],
    'min_child_weight': [1, 10, 100]
    }

There are only 81 combinations left and some of the very expensive combinations (e.g. 1500 trees of depth 15) are removed. Of course I don't know your data, so maybe it is necessary to consider such large / complex models. For a regression task with squared loss min_child_weight is just the number of instances in a child (again see XGB parameter docs). Since you have 500000 observations, it will probably not make (much of) a difference wether 1, 2, 3 or 4 observations end up in a leaf. Hence, I am suggesting [1, 10, 100] here. Maybe the random search finds something better than the default parameters in this grid?

An alternative strategy could be: Run cross validation for each combination of

hyperparameter_grid = {
    'max_depth': [3, 6, 9],
    'min_child_weight': [1, 10, 100]
    }

fixing the learning rate at some constant value (not to low, e.g. 0.15). For each setting use early stopping to determine an appropriate number of trees. This is possible using the early_stopping_rounds parameter of the xgboost.cv method. Afterwards you know a good combination of max_depth and min_child_weight (e.g. how complex do the base learners need to be for the given problem?) and also a good number of trees for this combination and the fixed learning rate. Fine tuning could then involve doing another hyperparameter search "close to" the current (max_depth, min_child_weight) solution and/or reducing the learning rate while increasing the number of trees.

And lastly, as answer is getting a bit long, there are other alternatives to a random search if an exhaustive grid search is to expensive. E.g. you could look at halving grid search and sequential model based optimization.