Search code examples
pandasdataframecomparisonmultiple-columns

Compare multiple columns in the dataframe and return column names with different values in a new column


Consider this df:

data = [{'name': 'Muhammad', 'age_x': 20, 'city_x': 'Karachi', 'age_y': 20, 'city_y': 'Karachi'},
        {'name': 'Ali', 'age_x': 19, 'city_x': 'Lahore', 'age_y': 30, 'city_y': None},
        {'name': 'Ahmed', 'age_x': 25, 'city_x': 'Islamabad', 'age_y': None, 'city_y': 'Islamabad'}]
            
df = pd.DataFrame(data)
cols = {'age_x':'age_y', 'city_x':'city_y'}

The idea is to compare multiple column pairs (age_x and age_y, city_x and city_y in this example) and return the column name(s) if the value is different. There are many columns to check that so would be good to use the cols dictionary in the solution. The column names to return are the ones with '_y'. So the expected result is:

    name     age_x  city_x    age_y city_y    diff
0   Muhammad 20     Karachi   20.0  Karachi   None
1   Ali      19     Lahore    30.0  None      age_y, city_y
2   Ahmed    25     Islamabad NaN   Islamabad age_y

Solution

  • Use apply on axis=1 and a list comprehension derived from cols:

    df["diff"] = df.apply(
        lambda row: ", ".join([y for x, y in cols.items() if row[x] != row[y]]) or None,
        axis=1,
    )
    
           name  age_x     city_x  age_y     city_y           diff
    0  Muhammad     20    Karachi   20.0    Karachi           None
    1       Ali     19     Lahore   30.0       None  age_y, city_y
    2     Ahmed     25  Islamabad    NaN  Islamabad          age_y
    

    You can also get cols from df instead of typing it manually:

    df = df.sort_index(axis=1)  # sort columns if not already sorted
    cols = {
        x: y
        for x, y in zip(df.filter(regex=r"_x$").columns, df.filter(regex=r"_y$").columns)
    }
    
    {'age_x': 'age_y', 'city_x': 'city_y'}