Search code examples
azureazure-data-factoryazure-databricksspark-notebook

Processing tables in parallel using Azure Data Factory, single pipeline, single Databricks Notebook?


I want to transform a list of tables in parallel using Azure Data Factory and one single Databricks Notebook.

I already have an Azure Data Factory (ADF) pipeline that receives a list of tables as a parameter, sets each table from the table list as a variable, then calls one single notebook (that performs simple transformations) and passes each table in series to this notebook. The problem is that it transforms the tables in series (one after the other) and not in parallel (all tables at the same time). I need the tables to be processed in parallel.

So, my questions are: 1) Is it possible to trigger the same Databricks notebook multiple times at the exact same point in time (each time with a different table as a parameter) from Azure Data Factory? 2) If yes, then what do I need to change in my pipeline or notebook to make it work?

Thanks in advance :)

Parameters

ADF Parameters

Variables

variables

Set Table Variables and Notebook

enter image description here

Configure Sequential

Configure Sequential

Sequential Unchecked with Batch Count = blank

When configured as "sequential" and Batch Count = blank, and pass two tables, the pipeline runs "successfully" but only one table is transformed (even if I add multiple tables in the table list). "Set variable" correctly shows twice, once for each table. But Orchestrate shows twice for the same table.

enter image description here

Sequential Unchecked with Batch Count = 2

When configured as "sequential" and Batch Count = 2, and pass two tables, the pipeline fails on the second iteration, but it also tries transforming the same table two times. "Set variable" correctly shows twice, once for each table. But Orchestrate shows twice for the same table.

Sequential Unchecked with Batch Count = 2

Sequential Checked or Batch Count =1

If I leave Sequential Checked or Batch Count =1, then the pipeline runs correctly and performs transformations on all tables, but the processing occurs in series (as expected). Example below for 5 tables.

Sequential Checked or Batch Count =1

Example of Sequential Checked or Batch Count =1

Set Variable Task

Set Variable Task Overview

Variable table passed with value @item()

Variable table passed with value @item()

Variable "table" defined as string

Variable "table"

Parameter "table_list"

Parameter "table_list"

Pipeline Run Parameters

Pipeline Run Parameters


Solution

  • I solved it using "Lookup" to a SQL tables instead of "Set Variable". The picture below shows a run of 5 tables in parallel using one single notebook.

    enter image description here