I have two dataframes, something like below:
df1:
date col1 col2 col3
15-5-2022 ABC 1 PQR
16-5-2022 BCD 2 ABC
17-5-2022 CDE 4 XYZ
df2:
date col1 col2 col3
5-4-2022 XYZ 1 ABC
6-4-2022 PQR 2 ABC
7-4-2022 BCD 4 PQR
My task is to get total number of unique values that are in df2.col1 but not in df1.col1. The way I am doing this is by creating first a list of all col1 unique values from df1 and then from df2 and then comparing these two lists and creating a third list with what exists in second list but not the first. Since I need the count of items in the final list, I am doing a len on third list. My code is like below:
list1 = df1.col1.unique()
list2 = df2.col1.unique()
list3 = [x for x in list2 if x not in list1]
num_list3 = len(list3)
This is getting my task done, but taking a very long time to run, probably because my dfs are quite big. I was wondering if there is a smarter and more efficient way of doing this please. I would appreciate any help
Use:
df2.loc[~df2['col1'].isin(df1['col1']), 'col1'].unique()
output: array(['XYZ', 'PQR'], dtype=object)
Or, with sets:
set(df2['col1']) - set(df1['col1'])
output: {'PQR', 'XYZ'}