Search code examples
pythonpandasdataframesubsetset-difference

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?


I have two dataframes, say df1 and df2, with the same column names.

Example:

df1

C1 | C2 | C3 | C4
A    1    2    AA
B    1    3    A
A    3    2    B

df2

C1 | C2 | C3 | C4
A    1    3    E
B    1    2    C
Q    4    1    Z

I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.

What would be a clean way to do this in Pandas?

So far, based on this answer, I have been able to find the common rows.

common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])

This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.

I have also seen this thread, but the answers all seem to assume a difference on all the columns.

The expected result for the above example (rows common on specified columns removed):

C1 | C2 | C3 | C4
A    3    2    B

Solution

  • Maybe not the cleanest, but you could add a key column to df1 to check against.

    Setting up the datasets

    import pandas as pd
    df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
                'C2': [1, 1, 3],
                'C3': [2, 3, 2],
                'C4': ['AA', 'A', 'B']})
    df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
                'C2': [1, 1, 4],
                'C3': [3, 2, 1],
                'C4': ['E', 'C', 'Z']})
    

    Adding a key, using your code to find the commons

    df1['key'] = range(1, len(df1) + 1)
    common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
    df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)