I'm looking for a light way of executing a databricks notebook that depends on multiple files having been loaded to Azure Data Lake Storage.
Multiple different ADF packages are loading different files into ADLS and then processed by databricks notebooks. Some of the notebooks depend on multiple files from different packages.
A single file is simple enough with an event trigger. Can this be generalised to more than one file without something like Airflow handling dependencies?
This isn't exactly light since you'll have to provision a Azure SQL table, but this is what I'll do:
To populate the table in Step#2, I'd use a Azure Logic App which will look for when a blob that meets your criteria is created and then subsequently update/create a new entry on the Azure SQL Table. See: https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-azureblobstorage & https://learn.microsoft.com/en-us/azure/connectors/connectors-create-api-sqlazure
You'll need to ensure that at the end of the Azure pipeline/Databricks Notebook that is ran, you update the Azure SQL flag of the respective dependencies to indicate these versions of the file is processed. Your Azure SQL Table will function as a 'watermark' table.
Before your pipeline triggers the Azure databricks notebook, your pipeline will look up the JSON file in ADLS, identify the dependencies for each Notebook, check if all the dependencies are available AND not processed by the Databricks notebook, and subsequently continue to run the Databricks notebook once all this criteria is met.
In terms of triggering your pipeline, you could either use an Azure LogicApp to do this or leverage a tumbling window on ADF.