Search code examples
pysparkazure-databricksdatabricks-autoloader

Autoloader - file notification and backfillInterval


Documentaion saying file event notification systems do not guarantee 100% delivery of all files and it recommneds to use backfills to guarantee that all files eventually get processed.

But its not clear how to use it and where to use it in code. Should it be part of the spark.readStream or writeStream.

And if there is more documentaion about it would appreciate that.


Solution

  • You can use the cloudFiles.backfillInterval like this:

    df = spark.readStream.format("cloudFiles") \
    .options(**autoloader_config) \
    .options("cloudFiles.backfillInterval", "1 day") \
    .load("/mnt/data_path/")
    

    According to the documentation, it checks for unprocessed files asynchronously and processes them.

    By setting the interval, you can control how frequently the system checks for unprocessed files.

    Output:

    If you see the checkpoint location.

    %fs head dbfs:/checkpointLocation2/offsets/0

    Here, based on lastBackfillStartTimeMs and lastBackfillFinishTimeMs, the trigger occurs.

    You can also observe that there are 5 files inside offsets, meaning it checked for old files to process 5 times.This is when i set the interval for 30 seconds for 1 day it will trigger once a day.