I have a function, extract_redundant_values
, to extract redundant rows from a pandas dataframe. I am testing it by running on in_df
to generate out_df
. I am then comparing this against my expected output expected_out_df
. They seem to have the same index, columns and values, but do not qualify as equal according to pd.DataFrame().equals()
:
import numpy as np
import pandas as pd
def extract_redundant_values(df, col):
unique_df = df.drop_duplicates(subset=[col],
keep = False)
redundant_df = df[~df.index.isin(unique_df.index)]
return redundant_df
# =============================================================================
# setup
# =============================================================================
expected_columns = ['Col1', 'Col2', 'Col3']
in_df = pd.DataFrame(data = [[1, 2, 3],
[4, 6, 6],
[7, 8, 9]],
columns = expected_columns)
# =============================================================================
# run
# =============================================================================
out_df = extract_redundant_values(df = in_df,
col = "Col1")
# =============================================================================
# compare
# =============================================================================
expected_out_df = pd.DataFrame(columns = expected_columns)
#same values, index and dataframe
assert out_df.columns.equals(expected_out_df.columns) #fine
assert out_df.index.equals(expected_out_df.index) #fine
assert np.array_equal(expected_out_df.values, out_df.values) #fine
#not the same for some other reason...
assert out_df.equals(expected_out_df) #assertion error
I have also tried comparing two empty dataframes with the same columns, and these were fine as expected - so I don't see why out_df
and expected_out_df
are considered different:
expected_columns = ['Col1', 'Col2', 'Col3']
eg_df1 = pd.DataFrame(columns = expected_columns)
eg_df2 = pd.DataFrame(columns = expected_columns)
assert eg_df1.equals(eg_df2) #fine
Can anyone offer an explanation?
Thanks!
Your expected_out_df
dataframe has datatypes of 'object' whereas your out_df
has datatypes of np.int64
and this is significant for .equals()
.
See this discussion: API: how strict should the equals() method be?
To fix this, you can set the expected_out_df
datatypes.
expected_out_df = pd.DataFrame(columns=expected_columns, dtype=np.int64)
Now your assertion should pass.