azure apache-spark apache-spark-sql azure-storage

How to Spark batch job global commit on ADLS Gen2?

I have spark batch application writing to ADLS Gen2 (hierarchy).

When designing the application I was sure the spark will perform global commit once the job is committed, but what it really does it commits on each task, meaning once task completes writing it moves from temp to target storage.

So if the batch fails we have partial data, and on retry we are getting data duplications. Our scale is really huge so rolling back (deleting data) is not an option for us, the search will takes a lot of time.

Is there any "built-in" solution, something we can use out of the box?

Right now we are considering writing to some temp destination and move files only after the whole job completed, but we would like to find some more elegant solution (if exists).

Solution

This is a known issue. Apache Iceberg, Hudi and Delta lake and among the possible solutions. Alternatively, instead of writing the output directly to the "official" location, write it to some staging directory instead. Once the job is done, rename the staging dir to the official location.