I'm trying to found a set of best hyperparameters for my Gradient Boosting Regressor with Grid Search CV. But I have difficulties getting the performance of the best model.
My code is as follows, this function is expected to return an optimized model.
def parameter_tuning_Gradient_Boost(X,
y,
):
model = GradientBoostingRegressor()
param_grid = { 'learning_rate': [0.005, 0.01, 0.02, 0.05, 0.1],
'subsample' : [1.0, 0.8, 0.6],
'n_estimators' : [100, 200, 500, 1000],
'max_depth' : [2, 4, 6, 8, 10]
}
grid_search = GridSearchCV( model,
param_grid,
cv = 5,
n_jobs = 8,
verbose = 0)
grid_search.fit(X = X,
y = y,)
print('Best Parameters by Searching: %s' % grid_search.best_params_)
best_parameters = grid_search.best_estimator_.get_params()
model = GradientBoostingRegressor( learning_rate = best_parameters['learning_rate'],
subsample = best_parameters['subsample'],
n_estimators = best_parameters['n_estimators'],
max_depth = best_parameters['max_depth'],
)
return model
In general, I have the following questions:
train_test_split
function to split X and y, and then feed X_train
and y_train
to grid_search.fit
function? Some said GridSearchCV
will automatically split data into train and test if you set cv = 5. But I saw some online tutorial will do something like this:X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
grid_search.fit(X_train, y_train)
print("Best Score:", grid_search.score(X, y))
print("Best Score: %.3f" % grid_search.best_score_)
I tried to apply GridSearchCV method to perform parameter tuning for a regressor and get its cross-validation performance, I want to know what is the default evaluation metrics here, and I want to know do I have to split the data into train and test set when I set cv parameter in GridSearchCV.
Yes GridSearchCV
will split the data into 5 train/test splits. It will then use these splits to find the optimal hyperparameters. However, it's also good practice to set aside a completely unseen split of the data. That you score the model(s) on when you are completely done with training. Take a look at this article to read more on this. Remember, after evaluating the model on the unseen data set. You are not "allowed" to improve your model.
The scoring metric for GridSearchCV
can be user defined. But as a standard it will use the scoring
parameter from the estimator. In this case squared_error
which is the default scoring for GradientBoostingRegressor
, the parameter is named loss=squared_error
.
To answer the second part of the question, we need to understand what happens in GridSearchCV
when fitting. When the optimal hyperparameters are found, a model is refitted on all the data with the optimal hyperparameters, if refit=True
, which it is by default.
grid_search.score(X,y)
scores the refitted model on all of the data while
grid_search._best_score
returns the average score of the models with the optimal hyperparameters on the five splits.