Search code examples
pythonpandasdataframeunit-testingtesting

Comparing empty dataframes


I have a function, extract_redundant_values, to extract redundant rows from a pandas dataframe. I am testing it by running on in_df to generate out_df. I am then comparing this against my expected output expected_out_df. They seem to have the same index, columns and values, but do not qualify as equal according to pd.DataFrame().equals():

import numpy as np
import pandas as pd

def extract_redundant_values(df, col):
    unique_df = df.drop_duplicates(subset=[col],
                                   keep = False)
    redundant_df = df[~df.index.isin(unique_df.index)]
    return redundant_df

# =============================================================================
# setup
# =============================================================================
expected_columns = ['Col1', 'Col2', 'Col3']
in_df = pd.DataFrame(data = [[1, 2, 3],
                            [4, 6, 6],
                            [7, 8, 9]], 
                    columns = expected_columns)

# =============================================================================
# run 
# =============================================================================
out_df = extract_redundant_values(df = in_df,
                                  col = "Col1")

# =============================================================================
# compare
# =============================================================================
expected_out_df = pd.DataFrame(columns = expected_columns)

#same values, index and dataframe
assert out_df.columns.equals(expected_out_df.columns) #fine
assert out_df.index.equals(expected_out_df.index) #fine
assert np.array_equal(expected_out_df.values, out_df.values) #fine

#not the same for some other reason...
assert out_df.equals(expected_out_df) #assertion error

I have also tried comparing two empty dataframes with the same columns, and these were fine as expected - so I don't see why out_df and expected_out_df are considered different:

expected_columns = ['Col1', 'Col2', 'Col3']
eg_df1 = pd.DataFrame(columns = expected_columns)
eg_df2 = pd.DataFrame(columns = expected_columns)
assert eg_df1.equals(eg_df2) #fine

Can anyone offer an explanation?

Thanks!


Solution

  • Your expected_out_df dataframe has datatypes of 'object' whereas your out_df has datatypes of np.int64 and this is significant for .equals().

    See this discussion: API: how strict should the equals() method be?

    To fix this, you can set the expected_out_df datatypes.

    expected_out_df = pd.DataFrame(columns=expected_columns, dtype=np.int64)
    

    Now your assertion should pass.