Search code examples
pythonisinstance

using isin across multiple columns


I'm trying to use .isin with the ~ so I can get a list of unique rows back based on multiple columns in 2 data-sets.

So, I have 2 data-sets with 9 rows: df1 is the bottom and df2 is the top (sorry but I couldn't get it to show both below, it showed 1 then a row of numbers)

   Index    Serial  Count   Churn
     1       9         5    0
     2       8         6    0
     3       10        2    1
     4       7         4    2
     5       7         9    2
     6       10        2    2
     7       2         9    1
     8       9         8    3
     9       4         3    5


    Index   Serial  Count   Churn
     1       10      2       1
     2       10      2       1
     3       9       3       0
     4       8       6       0
     5       9       8       0
     6       1       9       1
     7       10      3       1
     8       6       7       1
     9       4       8       0

I would like to get a list of rows from df1 which aren't in df2 based on more than 1 column.

For example if I base my search on the columns Serial and Count I wouldn't get Index 1 and 2 back from df1 as it appears in df2 at Index position 6, the same with Index position 4 in df1 as it appears at Index position 2 in df2. The same would apply to Index position 5 in df1 as it is at Index position 8 in df2.

The churn column doesn't really matter.

I can get it to work but based only on 1 column but not on more than 1 column.

df2[~df2.Serial.isin(df1.Serial.values)] kinda does what I want, but only on 1 column. I want it to be based on 2 or more.

  Index Serial  Count   Churn
   3    9          3    0
   6    1          9    1
   7    10         3    1
   8    6          7    1
   9    4          8    0

Solution

  • One solution is to merge with indicators:

    df1 = pd.DataFrame([[10, 2, 0], [9, 4, 1], [9, 8, 1], [8, 6, 1], [9, 8, 1], [1, 9, 1], [10, 3, 1], [6, 7, 1], [4, 8, 1]], columns=['Serial', 'Count', 'Churn'])
    df2 = pd.DataFrame([[9, 5, 1], [8, 6, 1], [10, 2, 1], [7, 4, 1], [7, 9, 1], [10, 2, 1], [2, 9, 1], [9, 8, 1], [4, 3, 1]], columns=['Serial', 'Count', 'Churn'])
    # merge with indicator on
    df_temp = df1.merge(df2[['Serial', 'Count']].drop_duplicates(), on=['Serial', 'Count'], how='left', indicator=True)
    res = df_temp.loc[df_temp['_merge'] == 'left_only'].drop('_merge', axis=1)
    
    Output        
       Serial  Count  Churn
    1       9      4      1
    5       1      9      1
    6      10      3      1
    7       6      7      1
    8       4      8      1