Search code examples
pythonscikit-learnlogistic-regression

In a linear regressio model using scikit, how does the confusion matrix know which is the positive class?


I am performing a cancer prediction task (where 1 is cancer case and 0 is a control). The tutorials I've watched never seem to indicate to the Logistic Regression model which is the positive class to eventually produce the confusion matrix.

Is it by default that the true positives will display the '1's correctly predicted and vice versa?


Solution

  • In sklearn.metrics.confusion_matrix we have a parameter called labels with default value None. The documentation of labels tells us:

    List of labels to index the matrix. This may be used to reorder or select a subset of labels. If None is given, those that appear at least once in y_true or y_pred are used in sorted order.

    So to assign proper index to your classes, pass them sequentially to labels Say for example positive = 1, negative = 0

    from sklearn.metrics import confusion_matrix as cm
    >>> y_test = [1, 0, 0]
    >>> y_pred = [1, 0, 0]
    >>> cm(y_test, y_pred, labels=[1,0])
    array([[1, 0],
           [0, 2]])
    
                  Pred
                 |  pos=1 | neg=0 |
             ___________________
    Actual  pos=1|  TP=1  | FN=0 |
            neg=0|  FP=0  | TN=2 |
    

    Note: The TP,TN,FP and FN have changed places by passing labels as [1,0]. TP means both predicted and actual value are positive. TN means both predicted and actual value are negative.Same analysis can be done for FP and FN.

    If we dont pass any value to labels, the y_true and y_pred values will be used in sorted order i.e [0,1].

    >>> y_test = [1, 0, 0]
    >>> y_pred = [1, 0, 0]
    >>> cm(y_test, y_pred)
    array([[2, 0],
           [0, 1]])
                     Pred
                 |  neg=0 | pos=1 |
             ___________________
    Actual  neg=0|  TN=2  | FP=0 |
            pos=1|  FN=0  | TN=1 |
    

    This will become even more clear if we use more than 2 labels. Cat=1, Dog=2, Mouse=3 If you want the order to be Cat, Mouse, and Dog then labels=[1,3,2]

    >>> y_test = [1, 2, 3]
    >>> y_pred = [1, 3, 2]
    >>> cm(y_test, y_pred, labels=[1,3,2])
    array([[1, 0, 0],
           [0, 0, 1],
           [0, 1, 0]])
    
                    Pred
              |  1  |  3  |  2 |
              __________________
    Actual  1 |   1 |  0  |  0 |
            3 |   0 |  0  |  1 |
            2 |   0 |  1  |  0 |
    

    If you want some other order like Dog,Mouse, and Cat then labels=[2,3,1]

    >>> cm(y_test, y_pred, labels=[2,3,1])
    array([[0, 1, 0],
           [1, 0, 0],
           [0, 0, 1]])