I could not find an open source tool or library to compare two parquet files. Presuming I did not overlook the obvious, is there a technical reason for this?
What would a programmer need to consider before writing a parquet diff tool?
I am using Python language.
Thank you.
The easiest combination would be to use pandas
together with pyarrow
. Once you have both packages installed, you can use https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_parquet.html to load the Apache Parquet file into a Pandas DataFrame and then use Pandas' assert_frame_equal
on the two resulting DataFrames.
Note that this will compare the two resulting DataFrames and not the exact contents of the Parquet files. As not all Parquet types can be matched 1:1 to Pandas, information like if it was a Date or a DateTime will get lost but Pandas offers a really good comparison infrastructure.
Alternatively, you could utilise Apache Arrow (the pyarrow
package mentioned above) and read the data into pyarrow.Table
and check for equality. This method preserves the type information much better but is less verbose on the differences if there are some:
import pyarrow.parquet as pq
table1 = pq.read_table('file1.parquet')
table2 = pq.read_table('file2.parquet')
assert table1.equals(table2)