Search code examples
pythonscikit-learnlogistic-regressionconfusion-matrix

What constitutes false positives / how to calculate FP rate


I'm following this tutorial https://youtu.be/0HDy6n3UD5M?t=1320 where he says he is calculating the false positives, but gets a numpy array of what I understand to be the 'false negatives' and 'false positives'.

E.g. confusion matrix is:

cm = confusion_matrix(y_train, y_pred, labels =[1,0])

[array([[250,  83],
       [ 76, 311]])] 

and he outputs the false positives as

FP = cm.sum(axis = 0) - np.diag(cm)
array([76, 83])

Shouldn't false positives just be 83? I read in another article that he might be calculating potential false positives but what does that mean? This seems to be a sum of FP and FN.

Rest of the code is:

FN = cm.sum(axis = 1) - np.diag(cm)
TP = np.diag(cm) 
TN = cm.sum() - (FP + FN + TP)
TPR = TP / (TP + FN)

Solution

  • It looks like that's trying to compute metrics in a class-dependent way.

    Normally we think of "false positives" as a single number corresponding to an entry in the confusion matrix:

    from sklearn.metrics import confusion_matrix
    
    y_true = [0, 0, 0, 1, 1, 1]
    y_pred = [0, 0, 1, 0, 0, 1]
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    print(f"Number of false positives: {fp}")
    # Number of false positives: 1
    

    But we can also frame the false positives in a class-dependent way. We can compute a confusion matrix for each class, giving a (C, 2, 2) matrix where C is the number of classes:

    mcm = multilabel_confusion_matrix(y_true, y_pred)
    # [[[1 2]
    #   [1 2]]
    #
    #  [[2 1]
    #   [2 1]]]
    

    Meaning we have a vector of true positives and a vector of false positives corresponding to each class:

    tps = mcm[:, 1, 1]
    # [2 1]
    
    fps = mcm[:, 0, 1]
    # [2 1]
    

    Allowing us to compute metrics like "precision for each class":

    print(f"Class-dependent precision: {tps / (tps + fps)}")
    # Class-dependent precision: [0.5 0.5]
    

    This is also how you arrive at the numbers in classification_report(y_true, y_pred):

                  precision    recall  f1-score   support
    
               0       0.50      0.67      0.57         3
               1       0.50      0.33      0.40         3
    
        accuracy                           0.50         6
       macro avg       0.50      0.50      0.49         6
    weighted avg       0.50      0.50      0.49         6