Search code examples
azure-databricks

You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved


I am using DataBricks as a service on Azure. This is my cluster info :

enter image description here

I ran below command and everythings was o.

 %sql
 Select 
    * 
 from db_xxxxx.t_fxxxxxxxxx
 limit 10

Then I have updated some rows in above table. When I run above command again i have this error :

    Error in SQL statement: SparkException: Job aborted due to stage failure: Task 3 in stage 2823.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2823.0 (TID 158824, 10.11.49.6, executor 14): com.databricks.sql.io.FileReadException: Error while reading file abfss:REDACTED_LOCAL_PART@storxfadev0501.dfs.core.windows.net/xsi-ed-faits/t_fait_xxxxxxxxxxx/_delta_log/00000000000000000022.json. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:286)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:251)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:354)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:205)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
        at org.apache.spark.scheduler.Task.run(Task.scala:112)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    Caused by: java.io.FileNotFoundException: HEAD https://storxfadev0501.dfs.core.windows.net/devdledxsi01/xsi-ed-faits/t_fait_photo_impact/_delta_log/00000000000000000022.json?timeout=90
    StatusCode=404
    StatusDescription=The specified path does not exist.
    ErrorCode=
    ErrorMessage=
        at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:912)
        at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.open(AzureBlobFileSystem.java:169)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
        at com.databricks.spark.metrics.FileSystemWithMetrics.open(FileSystemWithMetrics.scala:282)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
        at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
        at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.readFile(JsonDataSource.scala:134)
        at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:138)
        at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:136)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:134)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:235)
        ... 26 more
    Caused by: HEAD https://storxfadev0501.dfs.core.windows.net/devdledxsi01/xsi-ed-faits/t_fait_photo_impact/_delta_log/00000000000000000022.json?timeout=90
    StatusCode=404
    StatusDescription=The specified path does not exist.
    ErrorCode=
    ErrorMessage=
        at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:134)
        at shaded.databricks.v20180920_b

Solution

  • In summary, you can either refresh the table (previous to execution ) name or restart the cluster

    spark.sql("refresh TABLE schema.table")
    

    It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.