apache-spark apache-spark-sql parquet google-cloud-dataproc

Output Parquet file is very big in size after repartitioning with column in Spark

The dataframe which I am trying to repartition based on column is generating a single file of more than 500MB size.

df.repartition(col("column_name")).write.parquet("gs://path_of_bucket")

Is their a way to limit the size of output parquet file to 128MB? I don't want to use number of partitions as output can vary hourly. I am using dataproc cluster and output is going into GCS bucket.

Solution

You can use spark.sql.files.maxRecordsPerFile to split dataframe being written into files of X rows each.

Property Name spark.sql.files.maxRecordsPerFile
Default 0
Meaning Maximum number of records to write out to a single file. If this value is zero or negative, there is no limit.
Since Version 2.2.0

If your rows are more or less uniform in length, you can estimate the number X that would give your desired size (128MB).