Search code examples
apache-sparkamazon-s3emrorc

spark savemode.append file already exists


we experiencing rare issues with writing to S3 inside Spark jobs in Amazon EMR (5.13). Here is the part of the log:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 
...
Caused by: java.io.IOException: File already exists:s3://*****/part-00003-58fe4151-60d6-4605-b971-21dbda31678b-c000.snappy.orc
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:507)
...

It looks very strange because we use SaveMode.Append to save the dataset:

input.write().mode(SaveMode.Append).orc(path);

I googled a bit and found a couple of the same issues (look here) but we don't use the spark.speculation, so I just have no idea what happened.

Can anybody suggest me where I can search the roots of this problem?


Solution

  • the EMR code is closed so I cannot comment on its internals. I do know that without a consistently layer, committing work against S3 is prone to rare failures, either visible (here), or simply losing data as a fake directory rename misses new files when it lists things under a path.

    Try using local hdfs as a destination, or consistent EMR.