Search code examples
pythonpandasfunctiondataframekeyerror

Assigning values to new column based on multiple string conditions


What I have:

|    ID   |   Possible_Size    |   Actual_Size     |
|:------: |:------------------:|:-----------------:|  
|   1234  |         BIG        |        BIG        |
|   5678  |       MEDIUM       |        BIG        |
|   9876  |        SMALL       |       SMALL       |       
|   1092  |       MEDIUM       |       MEDIUM      |

What I would like to create:

|    ID   |   Possible_Size    |   Actual_Size     |       Big       |
|:------: |:------------------:|:-----------------:|:---------------:|  
|   1234  |         BIG        |        BIG        |  True Positive  |
|   5678  |       MEDIUM       |        BIG        |  False Negative |  
|   9876  |        BIG         |       SMALL       |  False Positive |   
|   1092  |       MEDIUM       |       MEDIUM      |                 |

What I have tried:

    def sizes(row):
                        
        if row['Actual_Size'] in ['BIG'] and row['Possible_Size'] in ['BIG']:
            df['Big'] = 'True Positive'
        elif row['Actual_Size'] in ['BIG'] and row['Possible_Size'] in ['MEDIUM', 'SMALL']:
            df['Big'] = 'False Negative'
        elif row['Actual_Size'] in ['MEDIUM', 'SMALL'] and row['Possible_Size'] in ['BIG']:
            df['Big'] = 'False Positive'  
        else:
            df['Big'] = ''
                        
    df.apply(sizes, axis=1)

Currently I am getting a blank 'Big' column


Solution

  • For this multiple if/elif statements you could use np.select:

    choices = ['True Positive','False Negative','False Positive']
    conditions = [
           ((df['Actual_Size'].isin(['BIG']))&(df['Possible_Size'].isin(['BIG']))), 
           ((df['Actual_Size'].isin(['BIG']))&(df['Possible_Size'].isin(['MEDIUM', 'SMALL']))),
           ((df['Actual_Size'].isin(['MEDIUM', 'SMALL']))&(df['Possible_Size'].isin(['BIG'])))]
    import numpy as np
    df['Big'] = np.select(conditions, choices, default='')
    

    If you want to keep your original solution, the problem was that you were returning nothing when applying the function row by row, so you could try this:

    def sizes(row):
    
        if row['Actual_Size'] in ['BIG'] and row['Possible_Size'] in ['BIG']:
            return'True Positive'
        elif row['Actual_Size'] in ['BIG'] and row['Possible_Size'] in ['MEDIUM', 'SMALL']:
            return 'False Negative'
        elif row['Actual_Size'] in ['MEDIUM', 'SMALL'] and row['Possible_Size'] in ['BIG']:
            return 'False Positive'  
        else:
            return ''
    
    df['Big']=df.apply(sizes, axis=1)
    

    Both outputs:

    df
         ID Possible_Size Actual_Size             Big
    0  1234           BIG         BIG   True Positive
    1  5678        MEDIUM         BIG  False Negative
    2  9876           BIG       SMALL  False Positive
    3  1092        MEDIUM      MEDIUM