Search code examples
azure-data-factoryazure-databricks

How to execute a databricks notebook when multiple files loaded to ADLS


I'm looking for a light way of executing a databricks notebook that depends on multiple files having been loaded to Azure Data Lake Storage.

Multiple different ADF packages are loading different files into ADLS and then processed by databricks notebooks. Some of the notebooks depend on multiple files from different packages.

A single file is simple enough with an event trigger. Can this be generalised to more than one file without something like Airflow handling dependencies?


Solution

  • This isn't exactly light since you'll have to provision a Azure SQL table, but this is what I'll do:

    1. I would create and store a JSON file in ADLS which details each notebook/pipeline and the file name dependencies.
    2. I'll then provision an Azure SQL Table to store the metadata of each of these files. Essentially, this table will have 3 columns:
    • General File Name (which matches the file name dependencies in step #1 (e.g.: FileName)
    • Real File Name (e.g.:FileName_20201007.csv)
    • Timestamp
    • Flag (boolean) if file is present
    • Flag (boolean) if file is processed (i.e.: it's dependent Databricks notebook has run)
    1. To populate the table in Step#2, I'd use a Azure Logic App which will look for when a blob that meets your criteria is created and then subsequently update/create a new entry on the Azure SQL Table. See: https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-azureblobstorage & https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-sqlazure

    2. You'll need to ensure that at the end of the Azure pipeline/Databricks Notebook that is ran, you update the Azure SQL flag of the respective dependencies to indicate these versions of the file is processed. Your Azure SQL Table will function as a 'watermark' table.

    Before your pipeline triggers the Azure databricks notebook, your pipeline will look up the JSON file in ADLS, identify the dependencies for each Notebook, check if all the dependencies are available AND not processed by the Databricks notebook, and subsequently continue to run the Databricks notebook once all this criteria is met.

    In terms of triggering your pipeline, you could either use an Azure LogicApp to do this or leverage a tumbling window on ADF.