Search code examples
apache-sparkhadooppysparkparquet

Pyspark: Save dataframe to multiple parquet files with specific size of single file


How can I save Pyspark dataframes to multiple parquet files with specific size?

Example: My dataframe use 500GB on HDFS, each file is 128MB. I want to save it to 250 parquet files, each file is 2GB. How can I archive this?


Solution

  • Its always good to make a simple search here if this has been asked or answered here, I can already see a couple:

    PySpark: How to specify file size when writing parquet files?

    Write pyspark dataframe into specific number of parquet files in total across all partition columns

    To save a PySpark dataframe to multiple Parquet files with specific size, you can use the repartition method to split the dataframe into the desired number of partitions, and then use the write method with the partitionBy option to save each partition as a separate Parquet file. For example, to save a dataframe to 250 Parquet files, each with a size of 2GB, you can use the following code:

    df = df.repartition(250)
    df.write.partitionBy("partition_column").parquet("hdfs:///path//")
    

    replace the partition_column with the name of the column you want to partition by. This organizes the output files by that column. partitionBy in this statement is optional.

    If you are reducing your partition size of a number higher than 250, then you can use coalesce to avoid shuffle, but repartition is better to ensure your desired output