Search code examples
parquetapache-arrow

How reproducible / deterministic is Parquet format?


I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet:

Having a data transformation F(a) = b where F is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet libraries) are used - how likely am I to get an identical binary representation of dataframe b on different hosts every time b is saved into Parquet?

In other words how reproducible Parquet is on binary level? When data is logically the same what can cause binary differences?

  • Can there be some uninit memory in between values due to alignment?
  • Assuming all serialization settings (compression, chunking, use of dictionaries etc.) are the same, can result still drift?

Context

I'm working on a system for fully reproducible and deterministic data processing and computing dataset hashes to assert these guarantees.

My key goal has been to ensure that dataset b contains an idendital set of records as dataset b' - this is of course very different from hashing a binary representation of Arrow/Parquet. Not wanting to deal with the reproducibility of storage formats I've been computing logical data hashes in memory. This is slow but flexible, e.g. my hash stays the same even if records are re-ordered (which I consider an equivalent dataset).

But when thinking about integrating with IPFS and other content-addressable storages that rely on hashes of files - it would simplify the design a lot to have just one hash (physical) instead of two (logical + physical), but this means I have to guarantee that Parquet files are reproducible.


Update

I decided to continue using logical hashing for now.

I've created a new Rust crate arrow-digest that implements the stable hashing for Arrow arrays and record batches and tries hard to hide the encoding-related differences. The crate's README describes the hashing algorithm if someone finds it useful and wants to implement it in another language.

I'll continue to expand the set of supported types as I'm integrating it into the decentralized data processing tool I'm working on.

In the long term, I'm not sure logical hashing is the best way forward - a subset of Parquet that makes some efficiency sacrifices just to make file layout deterministic might be a better choice for content-addressability.


Solution

  • At least in arrow's implementation I would expect, but haven't verified the exact same input (including identical metadata) in the same order to yield deterministic outputs (we try not to leave uninitialized values for security reasons) with the same configuration (assuming the compression algorithm chosen also makes the deterministic guarantee). It is possible there is some hash-map iteration for metadata or elsewhere that might also break this assumption.

    As @Pace pointed out I would not rely on this and recommend against relying on it). There is nothing in the spec that guarantees this and since the writer version is persisted when writing a file you are guaranteed a breakage if you ever decided to upgrade. Things will also break if additional metadata is added or removed ( I believe in the past there have been some big fixes for round tripping data sets that would have caused non-determinism).

    So in summary this might or might not work today but even if it does I would expect this would be very brittle.