first of all, sorry for the long description but I want that everyone understands my problem with what I doing.
I am working on a detection model which predicts 14 different pathologies and I have made an inference file that does prediction for any new test images. The dataset having test images of about 25k+ and I already find their prediction and made a file like this Dataframe.
In this data frame I have(little info to understand my scnario):
image_name______00000003_000.png
label_____[[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.0, 1024.0], [119.195767195767, 339.166137566138, 470.281481481481, 511.458201058202]], ['Cardiomegaly', 'Edema', 'Infiltration']]
Bounding_Box_____True/False
Atelectasis _____0.172639399766922
Cardiomegaly _____0.064461663365364
Consolidation _____0.436323910951614
Edema _____0.152604594826698
Effusion _____0.077432356774807
Emphysema _____0.569778263568878
Fibrosis _____0.333310723304749
Hernia _____0.219542726874351
Infiltration _____0.240452200174332
Mass _____0.291741400957108
Nodule _____0.076222963631153
Pleural_Thickening_____ 0.294208467006683
Pneumonia _____0.281939893960953
Pneumothorax _____0.386653006076813
What I want:
We can find it by two methods: e.g. getting those rows for each separate class.
Like firstly find all rows which contain Cardiomegaly
single or multi-label.
Then Apply the below operation belowOR find as you want and expertise to find the TP
.
I want if the image which has ground truth like ['Cardiomegaly', 'Edema', 'Infiltration']
and having 14 pathologies probabilities. I want to find True Positive
if these actual labels having the highest probabilities value for:
Like for Cardiomegaly
if it found the highest prob- then make a new col
and put it True
. I don't know what should I do for multi-label that after finding first what should I do for the 2nd label
and if its probability is highest then how I can manipulate.
I have done the last attempt with the help of @tlentali Thanks man for helping me out.
Here what I have done:
df = pd.read_csv('/home/ali/Desktop/CX/sample.csv')
df["best_score"] = df.drop(['file', 'set', 'label', 'bbx'], axis=1).idxmax(axis=1)
df['evaluation'] = df.apply(lambda x: x["best_score"] in x["label"], axis=1)
df.groupby('best_score')['evaluation'].mean()
which gives me like:
best_score
Atelectasis 0.452465
Cardiomegaly 0.250000
Consolidation 0.123164
Edema 0.029520
Effusion 0.555459
Emphysema 0.068618
Fibrosis 0.066116
Hernia 0.032258
Infiltration 0.400000
Mass 0.177524
Nodule 0.604167
Pleural_Thickening 0.188482
Pneumonia 0.049133
Pneumothorax 0.108156
Name: evaluation, dtype: float64
it's not what I want and it's only for a single label not for multi. Please help me out and sorry for the long description but it's only that everyone understands what I want. Thank you
From your DataFrame
:
>>> import pandas as pd
>>> df
file set label bbx Atelectasis Cardiomegaly Consolidation Edema Effusion Emphysema Fibrosis Hernia Infiltration Mass Nodule Pleural_Thickening Pneumonia Pneumothorax
0 00000003_000.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.145712 0.028958 0.205006 0.055228 0.115680 0.376638 0.349124 0.357694 0.122496 0.202218 0.075018 0.118994 0.195345 0.215577
1 00000003_001.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.132639 0.046136 0.169713 0.092743 0.285383 0.614464 0.311035 0.344040 0.117032 0.447748 0.152327 0.094364 0.174125 0.316022
2 00000003_002.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.233026 0.042541 0.227911 0.047988 0.116835 0.595102 0.330304 0.367272 0.117985 0.298624 0.109354 0.133473 0.185444 0.379627
3 00000003_003.png Test [[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.... False 0.298693 0.022646 0.237977 0.035348 0.143645 0.487804 0.384509 0.379062 0.083205 0.625744 0.102377 0.207353 0.184517 0.354402
4 00000003_004.png Test [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']] False 0.522152 0.052897 0.237475 0.082139 0.200029 0.473421 0.377468 0.336104 0.106339 0.488078 0.088047 0.146686 0.200919 0.313684
First, we eval
the column label
in order to extract the class we expect to predict :
>>> df['label'] = df['label'].apply(eval)
>>> df['class'] = df.label.apply(lambda x: x[1])
>>> df
0 [Hernia]
1 [Hernia]
2 [Hernia]
3 [Hernia, Infiltration]
4 [Hernia]
5 [Hernia]
6 [Hernia]
7 [Hernia]
8 [No Finding]
9 [Emphysema, Pneumothorax]
10 [Emphysema, Pneumothorax]
11 [Pleural_Thickening]
12 [Effusion, Emphysema, Infiltration, Pneumothorax]
13 [Emphysema, Infiltration, Pleural_Thickening, ...
14 [Effusion, Infiltration]
15 [Infiltration]
Name: class, dtype: object
Then, we explode
the column class
to get an expected class by row like so :
>>> df = df.explode('class')
>>> df = df.reset_index(drop=True)
>>> df['class']
0 Hernia
1 Hernia
2 Hernia
3 Hernia
4 Infiltration
5 Hernia
6 Hernia
7 Hernia
8 Hernia
9 No Finding
10 Emphysema
11 Pneumothorax
12 Emphysema
13 Pneumothorax
14 Pleural_Thickening
15 Effusion
16 Emphysema
17 Infiltration
18 Pneumothorax
19 Emphysema
20 Infiltration
21 Pleural_Thickening
22 Pneumothorax
23 Effusion
24 Infiltration
25 Infiltration
Name: class, dtype: object
Then, we transform the data in dummies format :
>>> classes = ['Atelectasis',
... 'Cardiomegaly',
... 'Consolidation',
... 'Edema',
... 'Effusion',
... 'Emphysema',
... 'Fibrosis',
... 'Hernia',
... 'Infiltration',
... 'Mass',
... 'Nodule',
... 'Pleural_Thickening',
... 'Pneumonia',
... 'Pneumothorax',
... 'No Finding']
>>> s = df['class']
>>> df_classes = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
>>> df_classes.head()
Effusion Emphysema Hernia Infiltration No Finding Pleural_Thickening Pneumothorax
0 0 0 1 0 0 0 0
1 0 0 1 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 1 0 0 0 0
4 0 0 0 1 0 0 0
As we are currently working on a toy dataset, we have to make some adjustement in order to exploit all the desired classes as dummies format :
>>> df_classes['Atelectasis'] = 0
>>> df_classes['Cardiomegaly'] = 0
>>> df_classes['Consolidation'] = 0
>>> df_classes['Edema'] = 0
>>> df_classes['Fibrosis'] = 0
>>> df_classes['Mass'] = 0
>>> df_classes['Nodule'] = 0
>>> df_classes['Pneumonia'] = 0
>>> df['No Finding'] = 0
Now, we can use sklearn
to get our TRP
and eventually the AUC
:
from sklearn.metrics import roc_curve, auc
n_classes = len(classes)
y_test = df_classes[classes].to_numpy()
y_score = df[classes].to_numpy()
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
Now, we can have a look at the roc_auc
values, the nan
are due to the fact that not all classes are predicted in the toy dataset :
>>> roc_auc
1: nan,
2: nan,
3: nan,
4: 0.3125,
5: 0.7613636363636364,
6: nan,
7: 0.9479166666666666,
8: 0.6190476190476191,
9: nan,
10: nan,
11: 0.30208333333333337,
12: nan,
13: 0.7840909090909091,
14: 0.5,
'micro': 0.66562764158918}
We can now plot the ROC_AUC
curve based on TPR
and FPR
for each class (noted classe
here, some class are empty as we work on a toy dataset) :
import matplotlib.pyplot as plt
plt.figure()
lw = 2
classe = 7
plt.plot(fpr[classe], tpr[classe], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[classe])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()