How does Scikit Learn compute f1_macro for multiclass classification?

I thought f1_macro for multiclass in Scikit will be computed using:

2 * Macro_precision * Macro_recall / (Macro_precision + Macro_recall)

But a manual check showed otherwise, a value slightly higher than what was computed by scikit. I went through the documentation and can't find a formula.

For instance, the iris data set yields this:

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
data=pd.DataFrame({
    'sepal length':iris.data[:,0],
    'sepal width':iris.data[:,1],
    'petal length':iris.data[:,2],
    'petal width':iris.data[:,3],
    'species':iris.target
})

X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y=data['species']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf=RandomForestClassifier(n_estimators=100)

clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

#Compute metrics using scikit
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
pre_macro = metrics.precision_score(y_test, y_pred, average="macro")
recall_macro = metrics.recall_score(y_test, y_pred, average="macro")
f1_macro_scikit = metrics.f1_score(y_test, y_pred, average="macro")
print ("Prec_macro_scikit:", pre_macro)
print ("Rec_macro_scikit:", recall_macro)
print ("f1_macro_scikit:", f1_macro_scikit)

Output:

Prec_macro_scikit: 0.9555555555555556
Rec_macro_scikit: 0.9666666666666667
f1_macro_scikit: 0.9586466165413534

However, a manual computation using:

f1_macro_manual = 2 * pre_macro * recall_macro / (pre_macro + recall_macro )

yields:

f1_macro_manual: 0.9610789980732178

I'm trying to figure out the disparity.

Solution

Macro-averaging doesn't work like that. A macro-average f1 score is not computed from macro-average precision and recall values.

Macro-averaging computes the value of a metric for each class and returns an unweighted average of the individual values. Thus, computing f1_score with average='macro' computes f1 scores for each class and returns the average of those scores.

If you want to compute the macro-average value yourself, specify average=None to get an array of binary f1 scores for each class, then take the mean() of that array:

binary_scores = metrics.f1_score(y_test, y_pred, average=None)
manual_f1_macro = binary_scores.mean()

Runnable demo here.