Search code examples
pythonscikit-learnprecision-recall

Sklearn precision recall curve pos_label for unbalanced dataset which class probability to use



i want to evaluate my model using the precision recall scores, because my data is unbalanced. Since I have a binary classification I am using a softmax at the end of my NN. The output scores and true labels look something like :

y_score = [[0.4, 0.6],
           [0.6, 0.4],
           [0.3, 0.7],
              ...   ]
y_true = [1,
          0,
          0,
         ...]

Where y_score[:,0] corresponds to the probability of class 0.
My positive labels are 0 and thus the negative labels are 1 in my case.

Since my dataset is unbalanded (more negatives than positives) I want to use the precision recall score (AUPRC) to evaluate my classifier. The function sklearn.metrics.precision_recall_curve takes a parameter pos_label, which I would set to pos_label = 0. But the parameter probas_pred takes an ndarray of probabilities of shape (n_samples,).

My question is, which of my y_score column should I take for probas_pred since I set pos_label = 0?

I hope my question is clear.
Thank you in advance!


Solution

  • It should be the first column in the example above, here's how you can check to be sure.

    Using an example dataset:

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.neural_network import MLPClassifier
    from sklearn.datasets import make_blobs
    from sklearn.metrics import precision_recall_curve
    
    X, y = make_blobs(n_samples=[400,2000], centers=None,n_features=5,random_state=999,cluster_std=5)
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=111)
    

    Train the classifier:

    clf = MLPClassifier(hidden_layer_sizes=(3, 3), random_state=999)
    clf.fit(X_train, y_train)
    

    Check the classes:

    clf.classes_
    array([0, 1])
    

    You can put it together on the dataframe to see that it's correct:

        0   1   actual
    0   0.999734    0.000266    0
    1   0.001253    0.998747    1
    2   0.000137    0.999863    1
    3   0.000113    0.999887    1
    4   0.003173    0.996827    1
    ... ... ... ...
    475 0.014316    0.985684    1
    476 0.012767    0.987233    1
    477 0.062735    0.937265    1
    478 0.000048    0.999952    1
    479 0.999733    0.000267    0
    

    Then calculate it:

    prec,recall,thres = precision_recall_curve(y_true=y_test , probas_pred= clf.predict_proba(X_test)[:,0], pos_label=0)
    

    And plot it.. if you flipped your values, this will look really weird, but below its correct:

    plt.plot(prec,recall)
    

    enter image description here