Search code examples
parquetpyarrowfastparquet

How can one append to parquet files and how does it affect partitioning?


Does parquet allow appending to a parquet file periodically ?

How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partition it by that column, if i were to append more data to it would parquet be able to automatically append data while preserving partitioning or would one have to repartition the file ?


Solution

  • Does parquet allow appending to a parquet file periodically ?

    Yes and No. The parquet spec describes a format that could be appended to by reading the existing footer, writing a row group, and then writing out a modified footer. This process is described a bit here.

    Not all implementations support this operation. The only implementation I am aware of at the moment is fastparquet (see this answer). It is usually acceptable, less complexity, and potentially better performance to cache and batch, either by caching in memory or writing the small files and batching them together at some point later.

    How does appending relate to partitioning if any?

    Parquet does not have any concept of partitioning.

    Many tools that support parquet implement partitioning. For example, pyarrow has a datasets feature which supports partitioning. If you were to append new data using this feature a new file would be created in the appropriate partition directory.