Search code examples
pythonpandasdataframemodelprediction

How to find True Postive only for Data Frame while having Ground Truth?


first of all, sorry for the long description but I want that everyone understands my problem with what I doing.

I am working on a detection model which predicts 14 different pathologies and I have made an inference file that does prediction for any new test images. The dataset having test images of about 25k+ and I already find their prediction and made a file like this Dataframe.

In this data frame I have(little info to understand my scnario):

image_name______00000003_000.png
label_____[[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.0, 1024.0], [119.195767195767, 339.166137566138, 470.281481481481, 511.458201058202]], ['Cardiomegaly', 'Edema', 'Infiltration']]
Bounding_Box_____True/False
Atelectasis _____0.172639399766922
Cardiomegaly _____0.064461663365364
Consolidation _____0.436323910951614
Edema _____0.152604594826698
Effusion _____0.077432356774807
Emphysema _____0.569778263568878
Fibrosis _____0.333310723304749
Hernia _____0.219542726874351
Infiltration _____0.240452200174332
Mass _____0.291741400957108
Nodule _____0.076222963631153
Pleural_Thickening_____ 0.294208467006683
Pneumonia _____0.281939893960953
Pneumothorax _____0.386653006076813

What I want: We can find it by two methods: e.g. getting those rows for each separate class. Like firstly find all rows which contain Cardiomegaly single or multi-label.

Then Apply the below operation belowOR find as you want and expertise to find the TP.

I want if the image which has ground truth like ['Cardiomegaly', 'Edema', 'Infiltration'] and having 14 pathologies probabilities. I want to find True Positive if these actual labels having the highest probabilities value for:

Like for Cardiomegaly if it found the highest prob- then make a new col and put it True. I don't know what should I do for multi-label that after finding first what should I do for the 2nd label and if its probability is highest then how I can manipulate. I have done the last attempt with the help of @tlentali Thanks man for helping me out. Here what I have done:

df = pd.read_csv('/home/ali/Desktop/CX/sample.csv')
df["best_score"] = df.drop(['file', 'set', 'label', 'bbx'], axis=1).idxmax(axis=1)
df['evaluation'] = df.apply(lambda x: x["best_score"] in x["label"], axis=1)
df.groupby('best_score')['evaluation'].mean()

which gives me like:

best_score
Atelectasis           0.452465
Cardiomegaly          0.250000
Consolidation         0.123164
Edema                 0.029520
Effusion              0.555459
Emphysema             0.068618
Fibrosis              0.066116
Hernia                0.032258
Infiltration          0.400000
Mass                  0.177524
Nodule                0.604167
Pleural_Thickening    0.188482
Pneumonia             0.049133
Pneumothorax          0.108156
Name: evaluation, dtype: float64

it's not what I want and it's only for a single label not for multi. Please help me out and sorry for the long description but it's only that everyone understands what I want. Thank you


Solution

  • From your DataFrame :

    >>> import pandas as pd
    
    >>> df
                    file    set     label                                        bbx    Atelectasis Cardiomegaly    Consolidation   Edema   Effusion    Emphysema   Fibrosis    Hernia  Infiltration    Mass    Nodule  Pleural_Thickening  Pneumonia   Pneumothorax
    0   00000003_000.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.145712    0.028958    0.205006    0.055228    0.115680    0.376638    0.349124    0.357694    0.122496    0.202218    0.075018    0.118994    0.195345    0.215577
    1   00000003_001.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.132639    0.046136    0.169713    0.092743    0.285383    0.614464    0.311035    0.344040    0.117032    0.447748    0.152327    0.094364    0.174125    0.316022
    2   00000003_002.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.233026    0.042541    0.227911    0.047988    0.116835    0.595102    0.330304    0.367272    0.117985    0.298624    0.109354    0.133473    0.185444    0.379627
    3   00000003_003.png    Test    [[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024....   False   0.298693    0.022646    0.237977    0.035348    0.143645    0.487804    0.384509    0.379062    0.083205    0.625744    0.102377    0.207353    0.184517    0.354402
    4   00000003_004.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.522152    0.052897    0.237475    0.082139    0.200029    0.473421    0.377468    0.336104    0.106339    0.488078    0.088047    0.146686    0.200919    0.313684
    

    First, we eval the column label in order to extract the class we expect to predict :

    >>> df['label'] = df['label'].apply(eval)
    >>> df['class'] = df.label.apply(lambda x: x[1])
    >>> df
    0                                              [Hernia]
    1                                              [Hernia]
    2                                              [Hernia]
    3                                [Hernia, Infiltration]
    4                                              [Hernia]
    5                                              [Hernia]
    6                                              [Hernia]
    7                                              [Hernia]
    8                                          [No Finding]
    9                             [Emphysema, Pneumothorax]
    10                            [Emphysema, Pneumothorax]
    11                                 [Pleural_Thickening]
    12    [Effusion, Emphysema, Infiltration, Pneumothorax]
    13    [Emphysema, Infiltration, Pleural_Thickening, ...
    14                             [Effusion, Infiltration]
    15                                       [Infiltration]
    Name: class, dtype: object
    

    Then, we explode the column class to get an expected class by row like so :

    >>> df = df.explode('class')
    >>> df = df.reset_index(drop=True)
    >>> df['class']
    0                 Hernia
    1                 Hernia
    2                 Hernia
    3                 Hernia
    4           Infiltration
    5                 Hernia
    6                 Hernia
    7                 Hernia
    8                 Hernia
    9             No Finding
    10             Emphysema
    11          Pneumothorax
    12             Emphysema
    13          Pneumothorax
    14    Pleural_Thickening
    15              Effusion
    16             Emphysema
    17          Infiltration
    18          Pneumothorax
    19             Emphysema
    20          Infiltration
    21    Pleural_Thickening
    22          Pneumothorax
    23              Effusion
    24          Infiltration
    25          Infiltration
    Name: class, dtype: object
    

    Then, we transform the data in dummies format :

    >>> classes = ['Atelectasis', 
    ...            'Cardiomegaly',
    ...            'Consolidation', 
    ...            'Edema', 
    ...            'Effusion', 
    ...            'Emphysema', 
    ...            'Fibrosis', 
    ...            'Hernia',
    ...            'Infiltration', 
    ...            'Mass', 
    ...            'Nodule', 
    ...            'Pleural_Thickening', 
    ...            'Pneumonia',
    ...            'Pneumothorax',
    ...            'No Finding']
    >>> s = df['class']
    >>> df_classes = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
    >>> df_classes.head()
        Effusion    Emphysema   Hernia  Infiltration    No Finding  Pleural_Thickening  Pneumothorax
    0   0           0           1       0               0           0                   0
    1   0           0           1       0               0           0                   0
    2   0           0           1       0               0           0                   0
    3   0           0           1       0               0           0                   0
    4   0           0           0       1               0           0                   0
    

    As we are currently working on a toy dataset, we have to make some adjustement in order to exploit all the desired classes as dummies format :

    >>> df_classes['Atelectasis'] = 0 
    >>> df_classes['Cardiomegaly'] = 0 
    >>> df_classes['Consolidation'] = 0 
    >>> df_classes['Edema'] = 0 
    >>> df_classes['Fibrosis'] = 0 
    >>> df_classes['Mass'] = 0 
    >>> df_classes['Nodule'] = 0 
    >>> df_classes['Pneumonia'] = 0 
    >>> df['No Finding'] = 0
    

    Now, we can use sklearn to get our TRP and eventually the AUC :

    from sklearn.metrics import roc_curve, auc
    
    
    n_classes = len(classes)
    y_test = df_classes[classes].to_numpy()
    y_score = df[classes].to_numpy()
    
    # Compute ROC curve and ROC area for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
    # Compute micro-average ROC curve and ROC area
    fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    

    Now, we can have a look at the roc_auc values, the nan are due to the fact that not all classes are predicted in the toy dataset :

    >>> roc_auc
     1: nan,
     2: nan,
     3: nan,
     4: 0.3125,
     5: 0.7613636363636364,
     6: nan,
     7: 0.9479166666666666,
     8: 0.6190476190476191,
     9: nan,
     10: nan,
     11: 0.30208333333333337,
     12: nan,
     13: 0.7840909090909091,
     14: 0.5,
     'micro': 0.66562764158918}
    

    We can now plot the ROC_AUC curve based on TPR and FPR for each class (noted classe here, some class are empty as we work on a toy dataset) :

    import matplotlib.pyplot as plt
    
    
    plt.figure()
    lw = 2
    classe = 7
    plt.plot(fpr[classe], tpr[classe], color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[classe])
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(loc="lower right")
    plt.show()