scala apache-spark amazon-s3 apache-spark-sql parquet

Should we avoid partitionBy when writing files to S3 in spark?

The parquet location is:

s3://mybucket/ref_id/date/camera_id/parquet-file

Let's say I have ref_id x3, date x 4, camera_id x 500, if I write parquet like below(use partitionBy), I will get 3x4x500=6000 files uploaded to S3. It is extremely slower than that just wrote a couple of files to the top-level bucket(no multiple level prefix)

What is the best practice? My colleague argues that partitionBy is good thing when used together with Hive metastore/table

df.write.mode("overwrite")\
  .partitionBy('ref_id','date','camera_id')\
  .parquet('s3a://mybucket/tmp/test_data')

Solution

If your problem is too many files, which seems to be the case, you need to repartition your RDD/dataframe before you write it. Each RDD/Dataframe partition will generate 1 file per folder.

df.repartition(1)\
 .write.mode("overwrite")\
 .partitionBy('ref_id','date','camera_id')\
 .parquet('s3a://mybucket/tmp/test_data')

As alternative to repartition you can also use coalesce.

If (after repartition to 1) the files are too small you, need to reduce the directory structure. The parquet documentation recommends file size between 500Mb and 1Gb.

https://parquet.apache.org/documentation/latest/

We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block.

If your files are a few Kb or Mb then you have a serious problem, it will seriously hurt performance.