Search code examples
pythonpandasnumpy

Comparing dataframes


The goal is to compare two pandas dataframes considering a margin of error.

To reproduce the issue:

Importing pandas

import pandas as pd

Case one - same data dataframes

df1 = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [3,3,3]})
df2 = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [3,3,3]})
print(df1.compare(df2, result_names=('df1', 'df2')))

# The result is an empty dataframe
Empty DataFrame
Columns: []
Index: []

Case two - different data dataframes

df1 = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [3,3,3]})
df2 = pd.DataFrame({"A": [1,1,1], "B": [2,2.2,2], "C": [3,3,3]}) # Note that the second B value is 2.2
print(df1.compare(df2, result_names=('df1', 'df2')))

# The result is a dataframe showing differences
     B     
   df1  df2
1  2.0  2.2

The issue is that I want that it only considers differences greater than 0.5

How I achieved it.

threshold = 0.5
df3 = df1.melt().reset_index().merge(df2.melt().reset_index(), on="index")
df3["diff"] = (df3["value_x"] - df3["value_y"]).abs()
print(df3.loc[df3["diff"] > threshold])

# The result is an empty dataframe
Empty DataFrame
Columns: [index, variable_x, value_x, variable_y, value_y, diff]
Index: []

Is there a better way to do this? It takes a lot of time for a huge DF.

In time:

  • This is only a reproducible example.
  • I am opened to use other libraries as Numpy.

Solution

  • Depending on your ultimate goal, assert_frame_equal with the atol parameter may work.

    from pandas.testing import assert_frame_equal
    
    # specify dtypes for the reproducible example
    # otherwise assert_frame_equal flags different dtypes (int vs. float)
    df1 = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [3,3,3]}, dtype=float)
    df2 = pd.DataFrame({"A": [1,1,1], "B": [2,2.2,2], "C": [3,3,3]}, dtype=float)
    
    assert_frame_equal(df1, df2, atol=0.5)