Search code examples
fileformatparquet

Why metadata is written at the end of the file in Apache Parquet?


I wonder why Apache Parquet writes metadata at the end of the file instead of the beginning?

In the official documentation of Apache Parquet, I found that Metadata is written after the data to allow for single pass writing.. Is the metadata written at the end to ensure the integrity of the file? I don't understand what this sentence really means, if someone could explain it to me, I'd appreciate it.


Solution

  • I think the main reason is so you can write bigger than memory data to the same file.

    The meta data contains information about the schema of the data (type of the columns) and its shape (number of row groups, size of each row groups).

    So in order to generate the metadata you need to know what the data is made of. This can be a problem if your data doesn't fit in memory.

    In this case, you should still be able to split your data in manageable row groups (that fit in memory) and append them to the file one by one, keeping track of the meta data, and appending the meta data at the end.

    import pyarrow as pa
    import pyarrow.parquet as pq
    
    
    schema = pa.schema([pa.field("col1", pa.int32())])
    
    with pq.ParquetWriter("table.parquet", schema=schema) as file:
        for i in range(0, 10):
            file.write(pa.table({"col1": [i] * 10}, schema=schema))
    

    If you're looking for an alternative where the data can be streamed, with the meta data being written at the beginning, you should look at the arrow IPC format.