I want to transform a list of tables in parallel using Azure Data Factory and one single Databricks Notebook.
I already have an Azure Data Factory (ADF) pipeline that receives a list of tables as a parameter, sets each table from the table list as a variable, then calls one single notebook (that performs simple transformations) and passes each table in series to this notebook. The problem is that it transforms the tables in series (one after the other) and not in parallel (all tables at the same time). I need the tables to be processed in parallel.
So, my questions are: 1) Is it possible to trigger the same Databricks notebook multiple times at the exact same point in time (each time with a different table as a parameter) from Azure Data Factory? 2) If yes, then what do I need to change in my pipeline or notebook to make it work?
Thanks in advance :)
Parameters
Variables
Set Table Variables and Notebook
Configure Sequential
Sequential Unchecked with Batch Count = blank
When configured as "sequential" and Batch Count = blank, and pass two tables, the pipeline runs "successfully" but only one table is transformed (even if I add multiple tables in the table list). "Set variable" correctly shows twice, once for each table. But Orchestrate shows twice for the same table.
Sequential Unchecked with Batch Count = 2
When configured as "sequential" and Batch Count = 2, and pass two tables, the pipeline fails on the second iteration, but it also tries transforming the same table two times. "Set variable" correctly shows twice, once for each table. But Orchestrate shows twice for the same table.
Sequential Checked or Batch Count =1
If I leave Sequential Checked or Batch Count =1, then the pipeline runs correctly and performs transformations on all tables, but the processing occurs in series (as expected). Example below for 5 tables.
Set Variable Task
Variable table passed with value @item()
Variable "table" defined as string
Parameter "table_list"
Pipeline Run Parameters
I solved it using "Lookup" to a SQL tables instead of "Set Variable". The picture below shows a run of 5 tables in parallel using one single notebook.