I have a binary classification problem where I use the following code to get my weighted avarege precision
, weighted avarege recall
, weighted avarege f-measure
and roc_auc
.
df = pd.read_csv(input_path+input_file)
X = df[features]
y = df[["gold_standard"]]
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = cross_validate(clf, X, y, cv=k_fold, scoring = ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'))
print("accuracy")
print(np.mean(scores['test_accuracy'].tolist()))
print("precision_weighted")
print(np.mean(scores['test_precision_weighted'].tolist()))
print("recall_weighted")
print(np.mean(scores['test_recall_weighted'].tolist()))
print("f1_weighted")
print(np.mean(scores['test_f1_weighted'].tolist()))
print("roc_auc")
print(np.mean(scores['test_roc_auc'].tolist()))
I got the following results for the same dataset with 2 different feature settings.
Feature setting 1 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6920, 0.6888, 0.6920, 0.6752, 0.7120
Feature setting 2 ('accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc'):
0.6806 0.6754 0.6806 0.6643 0.7233
So, we can see that in feature setting 1
we get good results for 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted' compared to feature setting 2
.
However, when it comes to 'roc_auc' feature setting 2
is better than feature setting 1
. I found this weird becuase every other metric was better with feature setting 1
.
On one hand, I suspect that this happens since I am using weighted
scores for precision, recall and f-measure
and not with roc_auc
. Is it possible to do weighted roc_auc
for binary classification in sklearn?
What is the real problem for this weird roc_auc
results?
It is not weird, because comparing all these other metrics with AUC is like comparing apples to oranges.
Here is a high-level description of the whole process:
p
in [0, 1]
.0/1
), we apply a threshold to these probabilities; if not set explicitly (like here), this threshold is implicitly taken to be 0.5, i.e. if p>0.5
then class=1
, else class=0
.0/1
, i.e after the threshold has been applied.So, it can certainly happen, and it can indeed lead to confusion among new practitioners.
The second part of my answer in this similar question might be helpful for more details. Quoting:
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.