I have the following two dataframes:
df2
has 2166 unique records.
I am trying to return the data in the blue area so I am using the following technique:
To get the green data I simply perform an inner join:
df_common = df1.merge(df2, how='inner' on='SKU')
which yields 413 unique records
To get the blue area, I appended the green data to df2
:
df2_plus_green = df2.append(df_common, ignore_index = True)
which yields 2579 records (which makes sense (413 + 2166 = 2579)
Then I dropped the duplicates:
df_blue = df2_plus_green.drop_duplicates(keep=False, subset=['SKU'])
which yields 1666 records. I would have thought it would yield 1753 records (2166 - 413 = 1753)
I have tried the above technique on a much smaller set of data and it works as expected. I assume the extra 87 records that are being dropped are duplicates somehow but I have checked each dataframe at each step along the way and all dataframes are made up of completely unique records. Can anyone point me in the right direction? I am sure it is something obvious that I'm missing.
While you're merging both the tables, use the Parameter "Indicator" which will tell us how each and every data point is joined.
df_right_only = pd.merge(df1, df2, on = "Common Column", how = "right", indicator = True)
Then we can use LOC to filter out only the values that are pulled uniquely from the right table.
df_right_only.loc[df_right_only ['_merge'] == "right_only", 'Column that you want']
This approach is known as Anti-Joins.