Search code examples
pythonpysparkparquet

Why do Parquet files generate multiple parts in Pyspark?


After some extensive research I have figured that

Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

However, I am unable to understand why parquet writes multiple files when I run df.write.parquet("/tmp/output/my_parquet.parquet") despite supporting flexible compression options and efficient encoding. Is this directly related to parallel processing or similar concepts?


Solution

  • Lots of frameworks make use of this multi-file layout feature of the parquet format. So I’d say that it’s a standard option which is part of the parquet specification, and spark uses it by default.

    This does have benefits for parallel processing, but also other use cases, such as processing (in parallel or series) on the cloud or networked file systems, where data transfer times may be a significant portion of total IO. in these cases, the parquet “hive” format, which uses small metadata files which provide statistics and information about which data files to read, offers significant performance benefits when reading small subsets of the data. This is true whether a single-threaded application is reading a subset of the data or if each worker in a parallel process is reading a portion of the whole.