Search code examples
pythonpandasioparquetpyarrow

Parquet File re-write has slightly larger size in both Pandas / PyArrow


So I am trying to read a parquet file into memory, choose chunks of the file and upload it to AWS S3 Bucket. I want to write sanity tests to check if a file was uploaded correctly through either size check or MD5 hash check between the local and cloud file on the bucket.

One thing I noticed is that reading a file into memory, either as bytes or pd.DataFrame / Table, and then re-writing the same object into a new file would change the file size, in my case increasing it compared to the original. Here's some sample code:

import pandas as pd
df = pd.read_parquet("data/example.parquet")

Then I simply write:

from io import ByteIO
buffer = ByteIO()
df.to_parquet(buffer) # this can be done straight without BytesIO. I use it for clarity.
with open('copy.parquet', 'rb') as f:
    f.write(buffer.getvalue())

Now using ls -l on both files give me different sizes:

37089 Oct 28 16:57 data/example.parquet
37108 Dec  7 14:17 copy.parquet

Interestingly enough, I tried using a tool such as xxd paired with diff, and to my surprise the binary difference was scattered all across the file, so I think it's safe to assume that this is not just limited to a metadata difference. Reloading both files into memory using pandas gives me the same table. It might also be worth mentioning that the parquet file contains both Nan and Nat values. Unfortunately I cannot share the file but see if I can replicate the behavior with a small sample. I also tried using Pyarrow's file reading functionality which resulted in the same file size:

import pyarrow as pa
import pyarrow.parquet as pq
with open('data/example.parquet', 'rb') as f:
    buffer = pa.BufferReader(obj)
table = pq.read_table(buffer)
pq.write_table(table, 'copy.parquet')

I have also tried turning on the compression='snappy' in both versions, but it did not change the output.

Is there some configuration I'm missing when writing back to disk?


Solution

  • Pandas uses pyarrow to read/write parquet so it is unsurprising that the results are the same. I am not sure what clarity using buffers gives compared to saving the files directly so I have left it out in the code below.

    What was used to write the example file? If it was not pandas but e.g. pyarrow directly that would show up as mostly meta data difference as pandas adds its own schema in addition to the normal arrow meta data.

    Though you say this is not the case here so the likely reason is that this file was written by another system with a different version of pyarrow, as Michael Delgado mentioned in the comments snappy compression is turned on by default. Snappy is not deterministic between systems:

    not across library versions (and possibly not even across architectures)

    This explains why you see the difference all over the file. You can try the code below to see that on the same machine the md5 is the same between files but the pandas version is larger due to the added meta data.

    Currently the arrow s3 writer does not check for integrity but the S3 API has such a functionality. I have opened an issue to make this accessible via arrow.

    import pandas as pd
    import pyarrow as pa
    import numpy as np
    import pyarrow.parquet as pq
    
    arr = pa.array(np.arange(100))
    table = pa.Table.from_arrays([arr], names=["col1"])
    
    pq.write_table(table, "original.parquet")
    
    pd_copy = pd.read_parquet("original.parquet")
    copy = pq.read_table("original.parquet")
    
    pq.write_table(copy, "copy.parquet")
    pd_copy.to_parquet("pd_copy.parquet")
    
    $ md5sum original.parquet copy.parquet pd_copy.parquet                                                                                 
    fb70a5b1ca65923fec01a54f85f17260  original.parquet
    fb70a5b1ca65923fec01a54f85f17260  copy.parquet
    dcb93cb89426a948e885befdbee204ff  pd_copy.parquet
    
    1092 copy.parquet
    1092 original.parquet
    2174 pd_copy.parquet