python pandas dataframe model prediction

How to find True Postive only for Data Frame while having Ground Truth?

first of all, sorry for the long description but I want that everyone understands my problem with what I doing.

I am working on a detection model which predicts 14 different pathologies and I have made an inference file that does prediction for any new test images. The dataset having test images of about 25k+ and I already find their prediction and made a file like this Dataframe.

In this data frame I have(little info to understand my scnario):

image_name______00000003_000.png
label_____[[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.0, 1024.0], [119.195767195767, 339.166137566138, 470.281481481481, 511.458201058202]], ['Cardiomegaly', 'Edema', 'Infiltration']]
Bounding_Box_____True/False
Atelectasis _____0.172639399766922
Cardiomegaly _____0.064461663365364
Consolidation _____0.436323910951614
Edema _____0.152604594826698
Effusion _____0.077432356774807
Emphysema _____0.569778263568878
Fibrosis _____0.333310723304749
Hernia _____0.219542726874351
Infiltration _____0.240452200174332
Mass _____0.291741400957108
Nodule _____0.076222963631153
Pleural_Thickening_____ 0.294208467006683
Pneumonia _____0.281939893960953
Pneumothorax _____0.386653006076813

What I want: We can find it by two methods: e.g. getting those rows for each separate class. Like firstly find all rows which contain Cardiomegaly single or multi-label.

Then Apply the below operation belowOR find as you want and expertise to find the TP.

I want if the image which has ground truth like ['Cardiomegaly', 'Edema', 'Infiltration'] and having 14 pathologies probabilities. I want to find True Positive if these actual labels having the highest probabilities value for:

Like for Cardiomegaly if it found the highest prob- then make a new col and put it True. I don't know what should I do for multi-label that after finding first what should I do for the 2nd label and if its probability is highest then how I can manipulate. I have done the last attempt with the help of @tlentali Thanks man for helping me out. Here what I have done:

df = pd.read_csv('/home/ali/Desktop/CX/sample.csv')
df["best_score"] = df.drop(['file', 'set', 'label', 'bbx'], axis=1).idxmax(axis=1)
df['evaluation'] = df.apply(lambda x: x["best_score"] in x["label"], axis=1)
df.groupby('best_score')['evaluation'].mean()

which gives me like:

best_score
Atelectasis           0.452465
Cardiomegaly          0.250000
Consolidation         0.123164
Edema                 0.029520
Effusion              0.555459
Emphysema             0.068618
Fibrosis              0.066116
Hernia                0.032258
Infiltration          0.400000
Mass                  0.177524
Nodule                0.604167
Pleural_Thickening    0.188482
Pneumonia             0.049133
Pneumothorax          0.108156
Name: evaluation, dtype: float64

it's not what I want and it's only for a single label not for multi. Please help me out and sorry for the long description but it's only that everyone understands what I want. Thank you

Solution

From your DataFrame :

>>> import pandas as pd

>>> df
                file    set     label                                        bbx    Atelectasis Cardiomegaly    Consolidation   Edema   Effusion    Emphysema   Fibrosis    Hernia  Infiltration    Mass    Nodule  Pleural_Thickening  Pneumonia   Pneumothorax
0   00000003_000.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.145712    0.028958    0.205006    0.055228    0.115680    0.376638    0.349124    0.357694    0.122496    0.202218    0.075018    0.118994    0.195345    0.215577
1   00000003_001.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.132639    0.046136    0.169713    0.092743    0.285383    0.614464    0.311035    0.344040    0.117032    0.447748    0.152327    0.094364    0.174125    0.316022
2   00000003_002.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.233026    0.042541    0.227911    0.047988    0.116835    0.595102    0.330304    0.367272    0.117985    0.298624    0.109354    0.133473    0.185444    0.379627
3   00000003_003.png    Test    [[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024....   False   0.298693    0.022646    0.237977    0.035348    0.143645    0.487804    0.384509    0.379062    0.083205    0.625744    0.102377    0.207353    0.184517    0.354402
4   00000003_004.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.522152    0.052897    0.237475    0.082139    0.200029    0.473421    0.377468    0.336104    0.106339    0.488078    0.088047    0.146686    0.200919    0.313684

First, we eval the column label in order to extract the class we expect to predict :

>>> df['label'] = df['label'].apply(eval)
>>> df['class'] = df.label.apply(lambda x: x[1])
>>> df
0                                              [Hernia]
1                                              [Hernia]
2                                              [Hernia]
3                                [Hernia, Infiltration]
4                                              [Hernia]
5                                              [Hernia]
6                                              [Hernia]
7                                              [Hernia]
8                                          [No Finding]
9                             [Emphysema, Pneumothorax]
10                            [Emphysema, Pneumothorax]
11                                 [Pleural_Thickening]
12    [Effusion, Emphysema, Infiltration, Pneumothorax]
13    [Emphysema, Infiltration, Pleural_Thickening, ...
14                             [Effusion, Infiltration]
15                                       [Infiltration]
Name: class, dtype: object

Then, we explode the column class to get an expected class by row like so :

>>> df = df.explode('class')
>>> df = df.reset_index(drop=True)
>>> df['class']
0                 Hernia
1                 Hernia
2                 Hernia
3                 Hernia
4           Infiltration
5                 Hernia
6                 Hernia
7                 Hernia
8                 Hernia
9             No Finding
10             Emphysema
11          Pneumothorax
12             Emphysema
13          Pneumothorax
14    Pleural_Thickening
15              Effusion
16             Emphysema
17          Infiltration
18          Pneumothorax
19             Emphysema
20          Infiltration
21    Pleural_Thickening
22          Pneumothorax
23              Effusion
24          Infiltration
25          Infiltration
Name: class, dtype: object

Then, we transform the data in dummies format :

>>> classes = ['Atelectasis', 
...            'Cardiomegaly',
...            'Consolidation', 
...            'Edema', 
...            'Effusion', 
...            'Emphysema', 
...            'Fibrosis', 
...            'Hernia',
...            'Infiltration', 
...            'Mass', 
...            'Nodule', 
...            'Pleural_Thickening', 
...            'Pneumonia',
...            'Pneumothorax',
...            'No Finding']
>>> s = df['class']
>>> df_classes = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
>>> df_classes.head()
    Effusion    Emphysema   Hernia  Infiltration    No Finding  Pleural_Thickening  Pneumothorax
0   0           0           1       0               0           0                   0
1   0           0           1       0               0           0                   0
2   0           0           1       0               0           0                   0
3   0           0           1       0               0           0                   0
4   0           0           0       1               0           0                   0

As we are currently working on a toy dataset, we have to make some adjustement in order to exploit all the desired classes as dummies format :

>>> df_classes['Atelectasis'] = 0 
>>> df_classes['Cardiomegaly'] = 0 
>>> df_classes['Consolidation'] = 0 
>>> df_classes['Edema'] = 0 
>>> df_classes['Fibrosis'] = 0 
>>> df_classes['Mass'] = 0 
>>> df_classes['Nodule'] = 0 
>>> df_classes['Pneumonia'] = 0 
>>> df['No Finding'] = 0

Now, we can use sklearn to get our TRP and eventually the AUC :

from sklearn.metrics import roc_curve, auc


n_classes = len(classes)
y_test = df_classes[classes].to_numpy()
y_score = df[classes].to_numpy()

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

Now, we can have a look at the roc_auc values, the nan are due to the fact that not all classes are predicted in the toy dataset :

>>> roc_auc
 1: nan,
 2: nan,
 3: nan,
 4: 0.3125,
 5: 0.7613636363636364,
 6: nan,
 7: 0.9479166666666666,
 8: 0.6190476190476191,
 9: nan,
 10: nan,
 11: 0.30208333333333337,
 12: nan,
 13: 0.7840909090909091,
 14: 0.5,
 'micro': 0.66562764158918}

We can now plot the ROC_AUC curve based on TPR and FPR for each class (noted classe here, some class are empty as we work on a toy dataset) :

import matplotlib.pyplot as plt


plt.figure()
lw = 2
classe = 7
plt.plot(fpr[classe], tpr[classe], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[classe])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()