Search code examples
spark-koalas

Comparing two koalas dataframes for testing purposes


Pandas has a testing module that includes assert_frames_equal. Does Koalas have anything similar?

I am writing tests on a whole set of transformations to koalas dataframes. At first, since my test csv files have only a few (<10) rows, I thought about just using pandas. Unfortunately, the files are quite wide (close to 200 columns) and have a variety of data types that are specified when spark reads the files. Since the type specification is very different for pandas than it is for koalas, I would have to write a whole ~200 list of dtypes, in addition to the type schema we already wrote for spark. Which is why we decided it would be more efficient to use spark and koalas to create the dataframes for the tests. But then, I can't find in the docs a way to compare the dataframes to see if the result of the transformations is the same as the expected one we created.


Solution

  • I ended up using this:

    assert_frames_equal(kdf1.to_pandas(), kdf2.to_pandas())
    

    This works, and I think it is okay because the data frames are "small." I wonder if the reason nothing like this has been implemented natively in koalas is because the main use of such an assertion would be in tests, and the tests should be small data frames anyway.