i want to evaluate my model using the precision recall scores, because my data is unbalanced. Since I have a binary classification I am using a softmax at the end of my NN.
The output scores and true labels look something like :
y_score = [[0.4, 0.6],
[0.6, 0.4],
[0.3, 0.7],
... ]
y_true = [1,
0,
0,
...]
Where y_score[:,0]
corresponds to the probability of class 0.
My positive labels are 0 and thus the negative labels are 1 in my case.
Since my dataset is unbalanded (more negatives than positives) I want to use the precision recall score (AUPRC) to evaluate my classifier. The function sklearn.metrics.precision_recall_curve
takes a parameter pos_label
, which I would set to pos_label = 0
. But the parameter probas_pred
takes an ndarray of probabilities of shape (n_samples,).
My question is, which of my y_score
column should I take for probas_pred
since I set pos_label = 0
?
I hope my question is clear.
Thank you in advance!
It should be the first column in the example above, here's how you can check to be sure.
Using an example dataset:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_blobs
from sklearn.metrics import precision_recall_curve
X, y = make_blobs(n_samples=[400,2000], centers=None,n_features=5,random_state=999,cluster_std=5)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=111)
Train the classifier:
clf = MLPClassifier(hidden_layer_sizes=(3, 3), random_state=999)
clf.fit(X_train, y_train)
Check the classes:
clf.classes_
array([0, 1])
You can put it together on the dataframe to see that it's correct:
0 1 actual
0 0.999734 0.000266 0
1 0.001253 0.998747 1
2 0.000137 0.999863 1
3 0.000113 0.999887 1
4 0.003173 0.996827 1
... ... ... ...
475 0.014316 0.985684 1
476 0.012767 0.987233 1
477 0.062735 0.937265 1
478 0.000048 0.999952 1
479 0.999733 0.000267 0
Then calculate it:
prec,recall,thres = precision_recall_curve(y_true=y_test , probas_pred= clf.predict_proba(X_test)[:,0], pos_label=0)
And plot it.. if you flipped your values, this will look really weird, but below its correct:
plt.plot(prec,recall)