Search code examples
apache-sparkerror-handlingpysparkazure-databricksdelta-lake

Databricks Checksum error while writing to a file


I am running a job in 9 nodes.

All of them are going to write some information to files doing simple writes like below:

dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

However I am receiving this exception:

py4j.protocol.Py4JJavaError: An error occurred while calling o106.save. : java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error: file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0 exp: 1179219224 got: -1020415797

It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.

Is there any known scenario that may be causing it?


Solution

  • So there are a couple of things going on and it should explain why coalesce may not work.

    1. What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.

    2. What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.

    Why coalesce would not work? Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.

    Here is the code that would work:

    dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)