python scikit-learn classification confusion-matrix precision-recall

Why does precision_recall_curve() return different values than confusion matrix?

I have written the following code to calculate the precision and the recall for a multiclass classification problem:

import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle

from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc, precision_recall_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score

def find_nearest(array, value):
    array = np.asarray(array)
    idx = (np.abs(array - value)).argmin()
    return idx

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(
    svm.SVC(kernel="linear", probability=True, random_state=random_state)
)
classifier.fit(X_train, y_train)
y_score = classifier.decision_function(X_test)

# Confusion matrix
from sklearn.metrics import classification_report
y_test_pred =  classifier.predict(X_test)
print(classification_report(y_test, y_test_pred))

# Compute ROC curve and ROC area for each class
precision = dict()
recall = dict()
threshold = dict()
for i in range(n_classes):
    c = classifier.classes_[i]
    precision[c], recall[c], threshold[c] = precision_recall_curve(y_test[:, c], y_score[:, c])
    th0 = find_nearest(threshold[c], 0)
    print(c, round(precision[c][th0],2), round(recall[c][th0], 2))

What I am trying to do is to re-calculate the precision and the recall shown by the confusion matrix

precision    recall  f1-score   support

           0       0.73      0.52      0.61        21
           1       1.00      0.07      0.12        30
           2       0.57      0.33      0.42        24

   micro avg       0.68      0.28      0.40        75
   macro avg       0.77      0.31      0.39        75
weighted avg       0.79      0.28      0.36        75
 samples avg       0.28      0.28      0.28        75

by using the precision_recall_curve() function. In theory it should return the exact same results as the confusion matrix when the threshold is equals to 0. However my results do not match the final result:

  precsion recall
0     0.75   0.57
1      1.0    0.1
2      0.6   0.38

Would you be able to explain this difference and how to properly calculate the values of the confusion matrix report?

Solution

As I wrote within the comment, considering index th0 + 1 rather than index th0 will solve the problem in your case. However, that might be just a case (as in this specific example the thresholds which are closer to 0 do always correspond to negative scores); therefore, for a programmatic approach, imo you should modify find_nearest to return the index for which threshold is positive and closest to 0. Indeed, you can see that by adding

print(th0, threshold[c][th0-1], threshold[c][th0], threshold[c][th0+1])

you'll get the following output:

20 -0.011161920989200713 -0.01053513227868108 0.016453546101096173
67 -0.04226738229343663 -0.0074193008862454835 0.09194626401603534
38 -0.011860865951094923 -0.003756310149749531 0.0076752136658660985

For a more programmatic approach, you can naively modify find_nearest as follows and keep index th0 within your loop.

def find_nearest_new(array, value):
    array = np.asarray(array)
    idx = (np.abs(np.where(array > 0, array, 999) - value)).argmin()
    return idx
...
for i in range(n_classes):
    c = classifier.classes_[i]
    precision[c], recall[c], threshold[c] = precision_recall_curve(y_test[:, c], y_score[:, c])
    th0 = find_nearest_new(threshold[c], 0)
    print(c, round(precision[c][th0],6), round(recall[c][th0], 6), round(threshold[c][th0],6))

My clue is the following, namely the fact that within precision_recall_curve implementation precision and recall are defined as follows:

precision: ndarray of shape (n_thresholds + 1,) Precision values such that element i is the precision of predictions with score >= thresholds[i] and the last element is 1.

recall: ndarray of shape (n_thresholds + 1,) Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i] and the last element is 0.

In other terms, if you sort the scores in descending order (according to the implementation) you'll see that the selected thresholds (whether you consider index th0 + 1) coincide with the first positive scores per class (indeed thresholds are nothing else but the distinct score values). On the other hand, if you stick to index th0 (in this specific example) you'll get scores which are strictly less than threshold=0.

for i in range(n_classes):
    c = classifier.classes_[i]
    precision[c], recall[c], threshold[c] = precision_recall_curve(y_test[:, c], y_score[:, c])
    th0 = find_nearest(threshold[c], 0)
    print(c, round(precision[c][th0+1],6), round(recall[c][th0+1], 6), round(threshold[c][th0+1],6))
    #print(c, precision[c], recall[c], threshold[c])
    print(np.sort(y_score[:,c])[::-1])

This post might be of help to get a grasp of how things work within precision_recall_curve().