Search code examples
apache-sparkazure-databricksdelta-lake

Manually Deleted data file from delta lake


I have manually deleted a data file from delta lake and now the below command is giving error

mydf = spark.read.format('delta').load('/mnt/path/data')
display(mydf)

Error

A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions

i have tried restarting the cluster with no luck also tried the below

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
spark.conf.set("spark.databricks.io.cache.enabled", "false")

Any help on repairing the transaction log or fix the error


Solution

  • as explained before you must use vacuum to remove files as manually deleting files does not lead to the delta transaction log being updated which is what spark uses to identify what files to read.

    In your case you can also use the FSCK REPAIR TABLE command. as per the docs : "Removes the file entries from the transaction log of a Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted."