Search code examples
pythonpandasdataframeinner-join

Opposite of left outer join in pandas based on one column results in less records than expected


I have the following two dataframes:

dataframes venn diagram

df2 has 2166 unique records.

I am trying to return the data in the blue area so I am using the following technique:

To get the green data I simply perform an inner join:
df_common = df1.merge(df2, how='inner' on='SKU')
which yields 413 unique records

To get the blue area, I appended the green data to df2:
df2_plus_green = df2.append(df_common, ignore_index = True)
which yields 2579 records (which makes sense (413 + 2166 = 2579)

Then I dropped the duplicates:
df_blue = df2_plus_green.drop_duplicates(keep=False, subset=['SKU'])
which yields 1666 records. I would have thought it would yield 1753 records (2166 - 413 = 1753)

I have tried the above technique on a much smaller set of data and it works as expected. I assume the extra 87 records that are being dropped are duplicates somehow but I have checked each dataframe at each step along the way and all dataframes are made up of completely unique records. Can anyone point me in the right direction? I am sure it is something obvious that I'm missing.


Solution

  • While you're merging both the tables, use the Parameter "Indicator" which will tell us how each and every data point is joined.

    df_right_only = pd.merge(df1, df2, on = "Common Column", how = "right", indicator = True)
    

    Then we can use LOC to filter out only the values that are pulled uniquely from the right table.

    df_right_only.loc[df_right_only ['_merge'] == "right_only", 'Column that you want']
    

    This approach is known as Anti-Joins.