Spark write parquet job completed but have a long delay to start new job

I am running Spark 2.4.4 on AWS EMR and experienced a long delay after the spark write parquet file to S3. I checked the S3 write process should be completed in few seconds (data files and _success file found in the S3). But it still delayed around 5 mins to start the following jobs.

I saw someone said this is called "Parquet Tax". I have tried the proposed fixes from those articles but still cannot resolve the issue. Can anyone give me a hand? thanks so much.

Solution

You can start with spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2.

You can set this config by using any of the following methods:

When you launch your cluster, you can put spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 in the Spark config.
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
When you write data using Dataset API, you can set it in the option, i.e. dataset.write.option("mapreduce.fileoutputcommitter.algorithm.version", "2").