Documentaion saying file event notification systems do not guarantee 100% delivery of all files and it recommneds to use backfills to guarantee that all files eventually get processed.
But its not clear how to use it and where to use it in code. Should it be part of the spark.readStream
or writeStream
.
And if there is more documentaion about it would appreciate that.
You can use the cloudFiles.backfillInterval
like this:
df = spark.readStream.format("cloudFiles") \
.options(**autoloader_config) \
.options("cloudFiles.backfillInterval", "1 day") \
.load("/mnt/data_path/")
According to the documentation, it checks for unprocessed files asynchronously and processes them.
By setting the interval, you can control how frequently the system checks for unprocessed files.
Output:
If you see the checkpoint location.
%fs head dbfs:/checkpointLocation2/offsets/0
Here, based on lastBackfillStartTimeMs
and lastBackfillFinishTimeMs
, the trigger occurs.
You can also observe that there are 5 files inside offsets, meaning it checked for old files to process 5 times.This is when i set the interval for 30 seconds
for 1 day
it will trigger once a day.