Search code examples
pandascomparisonunique

efficient way of getting unique values by comparing columns in two dataframes pandas


I have two dataframes, something like below:

  df1:

   date             col1          col2         col3
 15-5-2022          ABC            1            PQR
 16-5-2022          BCD            2            ABC
 17-5-2022          CDE            4            XYZ


  df2:

   date           col1          col2         col3
 5-4-2022          XYZ            1           ABC
 6-4-2022          PQR            2           ABC
 7-4-2022          BCD            4           PQR

My task is to get total number of unique values that are in df2.col1 but not in df1.col1. The way I am doing this is by creating first a list of all col1 unique values from df1 and then from df2 and then comparing these two lists and creating a third list with what exists in second list but not the first. Since I need the count of items in the final list, I am doing a len on third list. My code is like below:

list1 = df1.col1.unique()    
list2 = df2.col1.unique()
list3 = [x for x in list2 if x not in list1]
num_list3 = len(list3)
 

This is getting my task done, but taking a very long time to run, probably because my dfs are quite big. I was wondering if there is a smarter and more efficient way of doing this please. I would appreciate any help


Solution

  • Use:

    df2.loc[~df2['col1'].isin(df1['col1']), 'col1'].unique()
    

    output: array(['XYZ', 'PQR'], dtype=object)

    Or, with sets:

    set(df2['col1']) - set(df1['col1'])
    

    output: {'PQR', 'XYZ'}