Search code examples
pythonscikit-learnpipelinecross-validationgrid-search

Tuned 3 parameters using grid search but the best_estimator_ has only 2 parameters


I am tuning a gradient boosted classifier using a pipeline and grid search

My pipeline is

pipe = make_pipeline(StandardScaler(with_std=True, with_mean=True), \
    RFE(RandomForestClassifier(), n_features_to_select= 15), \
        GradientBoostingClassifier(random_state=42, verbose=True))

The parameter gri is:

    tuned_parameters = [{'gradientboostingclassifier__max_depth': range(3, 5),\
'gradientboostingclassifier__min_samples_split': range(4,6),\
'gradientboostingclassifier__learning_rate':np.linspace(0.1, 1, 10)}]

The grid search is done as

grid = GridSearchCV(pipe, tuned_parameters, cv=5, scoring='accuracy', refit=True)
grid.fit(X_train, y_train)

After fitting the model in train data, when I check the grid.best_estimator I can only find the 2 parameters(learning_rate and min_samples_split )that I am fitting. I don't find the max_depth parameter in the best estimator.

grid.best_estimator_.named_steps['gradientboostingclassifier'] =

GradientBoostingClassifier(learning_rate=0.9, min_samples_split=5,
                           random_state=42, verbose=True)

But, if I use the grid.cv_results to find the best 'mean_test_score' and find the corresponding parameters for that test score, then I can find the max_depth in it.

inde = np.where(grid.cv_results_['mean_test_score'] == max(grid.cv_results_['mean_test_score']))

    grid.cv_results_['params'][inde[-1][0]]
{'gradientboostingclas...rning_rate': 0.9, 'gradientboostingclas..._max_depth': 3, 'gradientboostingclas...ples_split': 5}
special variables
function variables
'gradientboostingclassifier__learning_rate':0.9
'gradientboostingclassifier__max_depth':3
'gradientboostingclassifier__min_samples_split':5

My doubt now is, if I use the trained pipeline (name of the object is 'grid' in my case) will it still use the 'max_depth' parameter also or will it not? Is it then better to use the 'best parameters' which gave me the best 'mean_test_score' taken from the grid.cv_results


Solution

  • Your pipeline has been tuned on all three parameters that you specified. It is just that the best value for max_depth happens to be the default value. When printing the classifier, default values will not be included. Compare the following outputs:

    print(GradientBoostingClassifier(max_depth=3)) # default
    # output: GradientBoostingClassifier()
    
    print(GradientBoostingClassifier(max_depth=5)) # not default
    # output: GradientBoostingClassifier(max_depth=5)
    

    In general, it is best-practice to access the best parameters by the best_params_ attribute of the fitted GridSearchCV object since this will always include all parameters:

    grid.best_params_