Let's say we tune an SVM with GridSearch like this:
algorithm = SVM()
parameters = {'kernel': ['rbf', 'sigmoid'], 'C': [0.1, 1, 10]}
grid= GridSearchCV(algorithm, parameters)
grid.fit(X, y)
You then wish to use the best fit parameters/estimator in a cross_val_score
. My question is, which model is grid
at this point? Is it the best performing one? In other words, can we just do
cross_val_scores = cross_val_score(grid, X=X, y=y)
or should we use
cross_val_scores = cross_val_score(grid.best_estimator_, X=X, y=y)
When I run both, I find that they do not return the same scores so I am curious what the correct approach is here. (I would assume using the best_estimator_
.) That raises another question, though, namely: what does using just grid
use as a model then? The first one?
You don't need cross_val_score
after fitting a GridSearchCV
. It already has attributes that allow you to access cross validation scores. cv_results_
gives you all. You can index into this with the best_index
attribute if you want to see only that specific estimator's results.
cv_results = pd.DataFrame(grid.cv_results_)
cv_results.iloc[grid.best_index_]
mean_fit_time 0.00046916
std_fit_time 1.3785e-05
mean_score_time 0.000251055
std_score_time 1.19038e-05
param_C 10
param_kernel rbf
params {'C': 10, 'kernel': 'rbf'}
split0_test_score 0.966667
split1_test_score 1
split2_test_score 0.966667
split3_test_score 0.966667
split4_test_score 1
mean_test_score 0.98
std_test_score 0.0163299
rank_test_score 1
Name: 5, dtype: object
Most of the methods you call on a fitted GridSearchCV
use the best model (grid.predict(...)
gets you the predictions for the best model, for example). This is not true for the estimator
. The difference you see is probably comes from that. cross_val_score
fits it again, but this time makes the scoring against grid.estimator
but not grid.best_estimator_
.