Search code examples
pythonpandasdataframecompareboolean-indexing

Compare 2 DataFrames and drop rows that do not contain corresponding ID variables


I need to compare 2 DataFrames and drop rows in either that do not contain the corresponding IDs. As an example consider df1 and df2.

df1 = pd.DataFrame({'ID':[1,2,3,4],
                'Food':['Ham','Cheese','Egg','Bacon',],
                'Amount':[5,2,10,4,],
                })

df2 = pd.DataFrame({'ID':[1,2,4,5],
                'Food':['Ham','Cheese','Bacon','Chocolate Salty Balls'],
                'Amount':[6,7,15,5000],
                })

Pseudocode:

if df1['ID'] notin df2['ID']:
   df2['ID'].drop()

...and vice versa. The outcome in the example would be:

df1:

   ID    Food  Amount
0   1     Ham       5
1   2  Cheese       2
2   4   Bacon       4

df2:

   ID    Food  Amount
0   1     Ham       6
1   2  Cheese       7
2   4   Bacon      15

Solution

  • You could use a boolean mask:

    msk1 = df1['ID'].isin(df2['ID'])
    msk2 = df2['ID'].isin(df1['ID'])
    df1 = df1[msk1]
    df2 = df2[msk2]
    

    or use set.intersection and boolean indexing:

    common = set(df1['ID']).intersection(df2['ID'])
    df1 = df1[df1['ID'].isin(common)]
    df2 = df2[df2['ID'].isin(common)]
    

    Output:

    df1:

       ID    Food  Amount
    0   1     Ham       5
    1   2  Cheese       2
    3   4   Bacon       4
    

    df2:

       ID    Food  Amount
    0   1     Ham       6
    1   2  Cheese       7
    2   4   Bacon      15