python machine-learning scikit-learn yellowbrick

Scikit-learn and Yellowbrick giving different scores

I am using sklearn to compute the average precision and roc_auc of a classifier and yellowbrick to plot the roc_auc and precision-recall curves. The problem is that the packages give different scores in both metrics and I do not know which one is the correct.

The code used:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score

seed = 42

# provides de data
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
                           n_informative=2, random_state=seed)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf_lr = LogisticRegression(random_state=seed)
clf_lr.fit(X_train, y_train)

y_pred = clf_lr.predict(X_test)
roc_auc = roc_auc_score(y_test, y_pred)
avg_precision = average_precision_score(y_test, y_pred)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")
print('='*20)

# visualizations
viz3 = ROCAUC(LogisticRegression(random_state=seed))
viz3.fit(X_train, y_train) 
viz3.score(X_test, y_test)
viz3.show()
viz4 = PrecisionRecallCurve(LogisticRegression(random_state=seed))
viz4.fit(X_train, y_train)
viz4.score(X_test, y_test)
viz4.show()

The code produces the following output:

As it can be seen above, the metrics give different values depending the package. In the print statement are the values computed by scikit-learn whereas in the plots appear annotated the values computed by yellowbrick.

Solution

Since you use the predict method of scikit-learn, your predictions y_pred are hard class memberships, and not probabilities:

np.unique(y_pred)
# array([0, 1])

But for ROC and Precision-Recall calculations, this should not be the case; the predictions you pass to these methods should be probabilities, and not hard classes. From the average_precision_score docs:

y_score: array, shape = [n_samples] or [n_samples, n_classes]

Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).

where non-thresholded means exactly not hard classes. Similar is the case for the roc_auc_score (docs).

Correcting this with the following code, makes the scikit-learn results identical to the ones returned by Yellowbrick:

y_pred = clf_lr.predict_proba(X_test)     # get probabilities
y_prob = np.array([x[1] for x in y_pred]) # keep the prob for the positive class 1
roc_auc = roc_auc_score(y_test, y_prob)
avg_precision = average_precision_score(y_test, y_prob)
print(f"ROC_AUC: {roc_auc}")
print(f"Average_precision: {avg_precision}")

Results:

ROC_AUC: 0.9545954595459546
Average_precision: 0.9541994473779806

As Yellowbrick handles all these computational details internally (and transparently), it does not suffer from the mistake in the manual scikit-learn procedure made here.

Notice that, in the binary case (as here), you can (and should) make your plots less cluttered with the binary=True argument:

viz3 = ROCAUC(LogisticRegression(random_state=seed), binary=True) # similarly for the PrecisionRecall curve

and that, contrary to what one migh expect intuitively, for the binary case at least, the score method of ROCAUC will not return the AUC, but the accuracy, as specified in the docs:

viz3.score(X_test, y_test)
# 0.88

# verify this is the accuracy:

from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf_lr.predict(X_test))
# 0.88