scikit-learn regression cross-validation gridsearchcv

How to correctly evaluate model performance of a regressor in GridSearchCV sklearn?

I'm trying to found a set of best hyperparameters for my Gradient Boosting Regressor with Grid Search CV. But I have difficulties getting the performance of the best model.

My code is as follows, this function is expected to return an optimized model.

def parameter_tuning_Gradient_Boost(X,
                                    y,
                                    ):

    model = GradientBoostingRegressor()

    param_grid = {  'learning_rate': [0.005, 0.01, 0.02, 0.05, 0.1],
                    'subsample'    : [1.0, 0.8, 0.6],
                    'n_estimators' : [100, 200, 500, 1000],
                    'max_depth'    : [2, 4, 6, 8, 10]
                    }

    grid_search = GridSearchCV( model,
                                param_grid,
                                cv = 5,
                                n_jobs = 8,
                                verbose = 0)

    grid_search.fit(X = X,
                    y = y,)

    print('Best Parameters by Searching: %s' % grid_search.best_params_)

    best_parameters = grid_search.best_estimator_.get_params()

    model = GradientBoostingRegressor(  learning_rate = best_parameters['learning_rate'],
                                        subsample = best_parameters['subsample'],
                                        n_estimators = best_parameters['n_estimators'],
                                        max_depth = best_parameters['max_depth'],
                                        )
    
    return model

In general, I have the following questions：

Do I have to use train_test_split function to split X and y, and then feed X_train and y_train to grid_search.fit function? Some said GridSearchCV will automatically split data into train and test if you set cv = 5. But I saw some online tutorial will do something like this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
grid_search.fit(X_train, y_train)

What is the metrics score for a regressor in GridSearchCV? after fitting GridSearchCV, I run the follwing commands and get way different score. I am wondering what is the correct way to get model cv perforamce for a regressor in GridSearchCV sklearn.

print("Best Score:", grid_search.score(X, y))
print("Best Score: %.3f" % grid_search.best_score_)

I tried to apply GridSearchCV method to perform parameter tuning for a regressor and get its cross-validation performance, I want to know what is the default evaluation metrics here, and I want to know do I have to split the data into train and test set when I set cv parameter in GridSearchCV.

Solution

Yes GridSearchCV will split the data into 5 train/test splits. It will then use these splits to find the optimal hyperparameters. However, it's also good practice to set aside a completely unseen split of the data. That you score the model(s) on when you are completely done with training. Take a look at this article to read more on this. Remember, after evaluating the model on the unseen data set. You are not "allowed" to improve your model.
The scoring metric for GridSearchCV can be user defined. But as a standard it will use the scoring parameter from the estimator. In this case squared_error which is the default scoring for GradientBoostingRegressor, the parameter is named loss=squared_error.

To answer the second part of the question, we need to understand what happens in GridSearchCV when fitting. When the optimal hyperparameters are found, a model is refitted on all the data with the optimal hyperparameters, if refit=True, which it is by default.

grid_search.score(X,y) scores the refitted model on all of the data while

grid_search._best_score returns the average score of the models with the optimal hyperparameters on the five splits.