I thought f1_macro for multiclass in Scikit will be computed using:
2 * Macro_precision * Macro_recall / (Macro_precision + Macro_recall)
But a manual check showed otherwise, a value slightly higher than what was computed by scikit. I went through the documentation and can't find a formula.
For instance, the iris data set yields this:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
data=pd.DataFrame({
'sepal length':iris.data[:,0],
'sepal width':iris.data[:,1],
'petal length':iris.data[:,2],
'petal width':iris.data[:,3],
'species':iris.target
})
X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y=data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
#Compute metrics using scikit
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
pre_macro = metrics.precision_score(y_test, y_pred, average="macro")
recall_macro = metrics.recall_score(y_test, y_pred, average="macro")
f1_macro_scikit = metrics.f1_score(y_test, y_pred, average="macro")
print ("Prec_macro_scikit:", pre_macro)
print ("Rec_macro_scikit:", recall_macro)
print ("f1_macro_scikit:", f1_macro_scikit)
Output:
Prec_macro_scikit: 0.9555555555555556
Rec_macro_scikit: 0.9666666666666667
f1_macro_scikit: 0.9586466165413534
However, a manual computation using:
f1_macro_manual = 2 * pre_macro * recall_macro / (pre_macro + recall_macro )
yields:
f1_macro_manual: 0.9610789980732178
I'm trying to figure out the disparity.
Macro-averaging doesn't work like that. A macro-average f1 score is not computed from macro-average precision and recall values.
Macro-averaging computes the value of a metric for each class and returns an unweighted average of the individual values. Thus, computing f1_score
with average='macro'
computes f1 scores for each class and returns the average of those scores.
If you want to compute the macro-average value yourself, specify average=None
to get an array of binary f1 scores for each class, then take the mean()
of that array:
binary_scores = metrics.f1_score(y_test, y_pred, average=None)
manual_f1_macro = binary_scores.mean()
Runnable demo here.