I am learning about multiclass classification using scikit learn. My goal is to develop a code which tries to include all the possible metrics needed to evaluate the classification. This is my code:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
param_grid = [
{'estimator__randomforestclassifier__n_estimators': [3, 10], 'estimator__randomforestclassifier__max_features': [2]},
# {'estimator__randomforestclassifier__bootstrap': [False], 'estimator__randomforestclassifier__n_estimators': [3, 10], 'estimator__randomforestclassifier__max_features': [2, 3, 4]}
]
rf_classifier = OneVsRestClassifier(
make_pipeline(RandomForestClassifier(random_state=42))
)
scoring = {'accuracy': make_scorer(accuracy_score),
'precision_macro': make_scorer(precision_score, average = 'macro'),
'recall_macro': make_scorer(recall_score, average = 'macro'),
'f1_macro': make_scorer(f1_score, average = 'macro'),
'precision_micro': make_scorer(precision_score, average = 'micro'),
'recall_micro': make_scorer(recall_score, average = 'micro'),
'f1_micro': make_scorer(f1_score, average = 'micro'),
'f1_weighted': make_scorer(f1_score, average = 'weighted')}
grid_search = GridSearchCV(rf_classifier, param_grid=param_grid, cv=2,
scoring=scoring, refit=False)
grid_search.fit(X_train_prepared, y_train)
However when I try to find out the best estimator, I get the following error message:
print(grid_search.best_params_)
print(grid_search.best_estimator_)
AttributeError: 'GridSearchCV' object has no attribute 'best_params_'
Question: How is it possible that even after fitting the model I do not get the best estimator?
I noticed that if I set refit="some_of_the_metrics"
, I get an estimator but I do not understand why I should use it since it would fit the method to optimize a metric instead of all of them.
Therefore, how can I get the best estimator for all the scores? And what is the the point of refit?
Note: I tried to read the documentation but it still does not make sense to me.
The point of refit is that the model will be refitted using the best parameter set found before and the entire dataset. To find the best parameters, cross-validation is used which means that the dataset is always split into a training and a validation set, i.e. not the entire dataset is used for training here.
When you define multiple metrics, you have to tell scikit-learn how it should determine what best means for you. For convenience, you can just specify any of your scorers to be used as the decider so to say. In that case, the parameter set that maximizes this metric will be used for refitting.
If you want something more sophisticated, like taking the parameter set that returned the highest mean of all scorers, you have to pass a function to refit that given all the created metrics returns the index of the corresponding best parameter set. This parameter set will then be used to refit the model.
Those metrics will be passed as a dictionary of strings as keys and NumPy arrays as values. Those NumPy arrays have as many entries as parameter sets that have been evaluated. You find a lot of things in there. What is probably the most relevant is mean_test_*scorer-name*
. Those arrays contain for each tested parameter set the mean scorer-name-scorer computed across the cv splits.
In code, to get the index of the parameter set, that returns the highest mean across all scorers, you can do the following
def find_best_index(eval_results: dict[str, np.array]) -> int:
# returns a n-scorers x n-parameter-set dimensional array
means_of_splits = np.array(
[values for name, values in eval_results.items() if name.startswith('mean_test')]
)
# this is a n-parameter-set dimensional vector
mean_of_all_scores = np.mean(means_of_splits, axis=0)
# get index of maximum value which corresponds to the best parameter set
return np.argmax(mean_of_all_scores)
grid_search = GridSearchCV(
rf_classifier, param_grid=param_grid, cv=2, scoring=scoring, refit=find_best_index
)
grid_search.fit(X_train_prepared, y_train)