Search code examples
pythonmachine-learningscikit-learncross-validationroc

How to get roc auc for binary classification in sklearn


I have binary classification problem where I want to calculate the roc_auc of the results. For this purpose, I did it in two different ways using sklearn. My code is as follows.

Code 1:

from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score

myscore = make_scorer(roc_auc_score, needs_proba=True)

from sklearn.model_selection import cross_validate
my_value = cross_validate(clf, X, y, cv=10, scoring = myscore)
print(np.mean(my_value['test_score'].tolist()))

I get the output as 0.60.

Code 2:

y_score = cross_val_predict(clf, X, y, cv=k_fold, method="predict_proba")

from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(2):
    fpr[i], tpr[i], _ = roc_curve(y, y_score[:,i])
    roc_auc[i] = auc(fpr[i], tpr[i])
print(roc_auc)

I get the output as {0: 0.41, 1: 0.59}.

I am confused since I get two different scores in the two codes. Please let me know why this difference happens and what is the correct way of doing this.

I am happy to provide more details if needed.


Solution

  • It seems that you used a part of my code from another answer, so I though to also answer this question.

    For a binary classification case, you have 2 classes and one is the positive class.

    For example see here. pos_label is the label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise an error will be raised..

    import matplotlib.pyplot as plt
    from sklearn import svm, datasets
    from sklearn.metrics import roc_curve, auc
    from sklearn.multiclass import OneVsRestClassifier
    from sklearn.model_selection import cross_val_predict
    from sklearn.linear_model import LogisticRegression
    import numpy as np
    
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    mask = (y!=2)
    y = y[mask]
    X = X[mask,:]
    print(y)
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
    
    positive_class = 1
    
    clf = OneVsRestClassifier(LogisticRegression())
    y_score = cross_val_predict(clf, X, y, cv=10 , method='predict_proba')
    
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    fpr[positive_class], tpr[positive_class], _ = roc_curve(y, y_score[:, positive_class])
    roc_auc[positive_class] = auc(fpr[positive_class], tpr[positive_class])
    print(roc_auc)
    
    {1: 1.0}
    

    and

    from sklearn.metrics import make_scorer
    from sklearn.metrics import roc_auc_score
    from sklearn.model_selection import cross_validate
    
    myscore = make_scorer(roc_auc_score, needs_proba=True)
    
    clf = OneVsRestClassifier(LogisticRegression())
    my_value = cross_validate(clf, X, y, cv=10, scoring = myscore)
    print(np.mean(my_value['test_score'].tolist()))
    1.0