Search code examples
apache-sparkpysparkdelta-lake

checksum error while writing data to delta table. Is there a way to fix this issue?


When trying to insert data into delta table, getting below error as checksum number mismatch. Is there a way to fix this issue.

Sark version : 3.4 (open source) Delta lake: 2.4

Total records in the table is 15602 Total delta table versions are 15602 Size of delta table is 1.4 GB Partitioned folders are 488

24/01/27 13:42:49 ERROR Executor: Exception in task 1.0 in stage 688.0 (TID 819) org.apache.spark.SparkException: 
Encountered error while reading file file:///home/kotesh/delete_data_compac/my_table/_delta_log/00000000000000015603.json. 
Details: at org.apache.spark.sql.errors.Query
ExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:877)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:307)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScan RDD.scala:125)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.sql.catalyst.expressions.Generated Class$GeneratedIteratorForCodegenStage2.processNext(Unknown
Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowlterator.java:43)
at
org.apache.spark.sql.execution.Whole
StageCodegenExec$$anon$1.hasNext(WholeStageCodegen Exec.scala:760)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1 (SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2 (RDD.scala:888)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted (RDD.scala:888)
at org.apache.spark.rdd.MapPartitions RDD.compute (MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint (RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners (TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3 (Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally (Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.hadoop.fChecksumException: Checksum error:
file:/home/kotesh/delete_data_compac/my_table/_delta_log/00000000000000015603.json 
at 0 exp: -70964045 got: 69110470
at org.apache.hadoop.fs.FSInputChecker.verifySums (FSInputChecker.java:347)

Solution

  • Table got corrupted due to multiple transaction logs which are written to delta_log folder. After removing the corrupted transaction file able to access the delta table.