The goal is to compare two pandas dataframes considering a margin of error.
To reproduce the issue:
import pandas as pd
df1 = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [3,3,3]})
df2 = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [3,3,3]})
print(df1.compare(df2, result_names=('df1', 'df2')))
# The result is an empty dataframe
Empty DataFrame
Columns: []
Index: []
df1 = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [3,3,3]})
df2 = pd.DataFrame({"A": [1,1,1], "B": [2,2.2,2], "C": [3,3,3]}) # Note that the second B value is 2.2
print(df1.compare(df2, result_names=('df1', 'df2')))
# The result is a dataframe showing differences
B
df1 df2
1 2.0 2.2
The issue is that I want that it only considers differences greater than 0.5
threshold = 0.5
df3 = df1.melt().reset_index().merge(df2.melt().reset_index(), on="index")
df3["diff"] = (df3["value_x"] - df3["value_y"]).abs()
print(df3.loc[df3["diff"] > threshold])
# The result is an empty dataframe
Empty DataFrame
Columns: [index, variable_x, value_x, variable_y, value_y, diff]
Index: []
Is there a better way to do this? It takes a lot of time for a huge DF.
In time:
Depending on your ultimate goal, assert_frame_equal
with the atol
parameter may work.
from pandas.testing import assert_frame_equal
# specify dtypes for the reproducible example
# otherwise assert_frame_equal flags different dtypes (int vs. float)
df1 = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": [3,3,3]}, dtype=float)
df2 = pd.DataFrame({"A": [1,1,1], "B": [2,2.2,2], "C": [3,3,3]}, dtype=float)
assert_frame_equal(df1, df2, atol=0.5)