Search code examples
pythonpandasequality

How to compare two dataframes and filter rows and columns where a difference is found


I am testing dataframes for equality.

df_diff=(df1!=df2)

I get df_diff which is same shape as df*, and contains boolean True/False.

Now I would like to keep only the columns and rows of df1 where there was at least a different value. If I simply do

   df1=[df_diff.values]

I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.

As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.

example:

df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])

I would like to get from df1

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9

to

    1  2
0  2  NaN  
1  NaN  6

Solution

  • I think you need DataFrame.any for check at least one True per rows of columns:

    df = df_diff[df_diff.any(axis=1)]
    

    It is possible to filter both of the original dataframes like so:

    df11 = df1[df_diff.any(axis=1)]
    df22 = df2[df_diff.any(axis=1)]
    

    If want all columns and rows:

    df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
    

    EDIT: Filter d1 and add NaNs by where:

    df_diff=(df1!=df2)
    
    m1 = df_diff.any(axis=1)
    m2 = df_diff.any()
    
    out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
    print (out)
         1    2
    0  2.0  NaN
    1  NaN  6.0