Search code examples
amazon-s3pysparkamazon-emrparquet

Spark write parquet job completed but have a long delay to start new job


I am running Spark 2.4.4 on AWS EMR and experienced a long delay after the spark write parquet file to S3. I checked the S3 write process should be completed in few seconds (data files and _success file found in the S3). But it still delayed around 5 mins to start the following jobs.

I saw someone said this is called "Parquet Tax". I have tried the proposed fixes from those articles but still cannot resolve the issue. Can anyone give me a hand? thanks so much. enter image description here


Solution

  • You can start with spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2.

    You can set this config by using any of the following methods:

    • When you launch your cluster, you can put spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 in the Spark config.
    • spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
    • When you write data using Dataset API, you can set it in the option, i.e. dataset.write.option("mapreduce.fileoutputcommitter.algorithm.version", "2").