python machine-learning scikit-learn cross-validation grid-search

Gridsearch technique in sklearn, python

I am working on a supervised machine learning algorithm and it seems to have a curious behavior. So, let me start:

I have a function where I pass different classifiers, their parameters, training data and their labels:

def HT(targets,train_new, algorithm, parameters):
#creating my scorer
scorer=make_scorer(f1_score)
#creating the grid search object with the parameters of the function
grid_search = GridSearchCV(algorithm, 
param_grid=parameters,scoring=scorer,   cv=5)
# fit the grid_search object to the data
grid_search.fit(train_new, targets.ravel())
# print the name of the classifier, the best score and best parameters
print algorithm.__class__.__name__
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
# assign the best estimator to the pipeline variable
pipeline=grid_search.best_estimator_
# predict the results for the training set
results=pipeline.predict(train_new).astype(int)
print results    
return pipeline

To this function I pass parameters like:

clf_param.append( {'C' : np.array([0.001,0.01,0.1,1,10]), 
'kernel':(['linear','rbf']),
'decision_function_shape' : (['ovr'])})

Ok, so here is where things start to get strange. This functions is returning a f1_score but it is different from the score I am computing manually using the formula: F1 = 2 * (precision * recall) / (precision + recall)

There are pretty big differences (0.68 compared with 0.89)

I am doing something wrong in the function ? The score computed by grid_search (grid_search.best_score_) should be the same with the score on the whole training set (grid_search.best_estimator_.predict(train_new)) ? Thanks

Solution

The score that you are manually calculating takes into account the global true positives and negatives for all classes. But in scikit, f1_score, the default approach is to calculate the binary average (i.e only for the positive class).

So, in order to achieve the same scores, use the f1_score as specified below:

scorer=make_scorer(f1_score, average='micro')

Or simply, in the gridSearchCV, use:

scoring = 'f1_micro'

More information about how the averaging of scores is done is given on: - http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values

You may also want to take a look at the following answer which describes the calculation of scores in scikit in detail:-

https://stackoverflow.com/a/31575870/3374996

EDIT: Changed macro to micro. As written in documentation:

'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.