Search code examples
pythonpandasdataframefunctionlambda

Use fields of one dataframe as conditions to fill a field of another dataframe


I have 2 dataframes, the first is a small dataframe (df1) with information to use to fill a field (named Flag) of the second dataframe (df2). I need to write a function that uses each row of df1 as parameters to fill each row of df2 with a certain value (Y or N).

df1 =

type q25 q75
A 13 98
B 381 500
C 34 103

df2 =

field1 field2 ... TYPE delta Flag
field1 field2 ... A 379 Y
field1 field2 ... C 90 N
field1 field2 ... A 50 N
field1 field2 ... B 2000 Y

I tried this code, but unfortunately it doesn't work because it overwrites the lines in df2 related to the cases of df1 with type equal to A or B, and it works for the last record of df2 (type equal to C)

def filling(x, row):
    IQR = row['q75'] - row['q25']
    if (x['Flag'] != 'Y'):
        if row['type'] == x['TYPE'] and (x['delta'] < row['q25'] - 1.5*IQR or x['delta'] > row['q75'] + 1.5*IQR):  
            return 'Y'
        else:
            return 'N'

for index, row in df2.iterrows():
    df1['Flag'] = df1.apply(lambda x : filling(x, row), axis=1) 

How could I fix it?


Solution

  • From what I understand you would like to have a flag column which tell you whether the particular row is an outlier or not. Here is a vectorized and concise way to achieve that:

    # Merge the dataframes on type column
    s = df2.merge(df1, left_on='TYPE', right_on='type', how='left')
    
    # calculate IQR and condition to check for outlier
    s['IQR'] = s['q75'] - s['q25']
    is_outlier = ~s['delta'].between(s['q25'] - 1.5 * s['IQR'], s['q25'] + 1.5 * s['IQR'])
    
    # Use np.where to select Y/N based on the outlier condition
    s['Flag'] = np.where(s['Flag'].ne('Y') & is_outlier, 'Y', s['Flag'])
    
    # drop the columns from df1
    s = s.drop(columns=df1.columns)
    

    Result

    print(s)
    
       field1  field2  ... TYPE  delta Flag  IQR
    0  field1  field2  ...    A    379    Y   85
    1  field1  field2  ...    C     90    N   69
    2  field1  field2  ...    A     50    N   85
    3  field1  field2  ...    B   2000    Y  119