The parquet location is:
s3://mybucket/ref_id/date/camera_id/parquet-file
Let's say I have ref_id
x3, date
x 4, camera_id
x 500, if I write parquet like below(use partitionBy
), I will get 3x4x500=6000
files uploaded to S3. It is extremely slower than that just wrote a couple of files to the top-level bucket(no multiple level prefix)
What is the best practice? My colleague argues that partitionBy
is good thing when used together with Hive metastore/table
df.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
If your problem is too many files, which seems to be the case, you need to repartition your RDD/dataframe before you write it. Each RDD/Dataframe partition will generate 1 file per folder.
df.repartition(1)\
.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
As alternative to repartition
you can also use coalesce
.
If (after repartition to 1) the files are too small you, need to reduce the directory structure. The parquet documentation recommends file size between 500Mb and 1Gb.
https://parquet.apache.org/documentation/latest/
We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block.
If your files are a few Kb or Mb then you have a serious problem, it will seriously hurt performance.