I am running Spark 2.4.4 on AWS EMR and experienced a long delay after the spark write parquet file to S3. I checked the S3 write process should be completed in few seconds (data files and _success file found in the S3). But it still delayed around 5 mins to start the following jobs.
I saw someone said this is called "Parquet Tax". I have tried the proposed fixes from those articles but still cannot resolve the issue. Can anyone give me a hand? thanks so much.
You can start with spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2.
You can set this config by using any of the following methods: