azure azure-databricks azure-data-factory

Data Factory - Foreach activity: run in parallel but sequentially

I'm creating a ADF pipeline and I'm using a for each activity to run multiple databricks notebook.

My problem is that two notebooks have dependencies on each other.

That is, a notebook has to run before the other, because it has dependency. I know that the for each activity can be executed sequentially and by batch. But the problem is that when running sequentially it will run one by one, that is, as I have partitions, it will take a long time.

What I wanted is to run sequentially but by batch. In other words, I have a notebook that will run with ES, UK, DK partitions, and I wanted it to run in parallel these partitions of this notebook and to wait for the total execution of this notebook and only then would it start to run the other notebook by the same partitions. If I put it by batch, it doesn't wait for full execution, it starts running the other notebook randomly.

The part of the order of notebooks I get through a config table, in which I specify which order they should run and then I have a notebook that defines my final json with that order.

Config Table:

sPath	TableSource	TableDest	order
path1	dbo.table1	dbo.table1	1
path2	dbo.table2	dbo.table2	2

This is my pipeline:

and the execution I wanted by batch and sequentially but it is not possible to select by sequential and batch count at the same time.

Can anyone please help me in achieving this?

Thank you!

Solution

I have tried to reproduce this. For running for-each sequentially and batchwise, we need to have two pipelines- one nested inside the other pipeline. Outer pipeline is used for running sequentially and inner pipeline is for running batchwise. Below are the steps

Took a sample config file as in below image.

config_table

Pipeline 1 is considered as the outer pipeline. In that Lookup activity is used to select only sortorder field data in increasing order. Sortorder value will be passed as a parameter to child pipeline sequentially.

select distinct sortorder from config_table order by sortorder

Lookup

For each activity is added after the lookup activity. We use this for sequential run. Thus, Sequential is checked and in items text box, output of lookup activity is given.

@activity('Lookup1').output.value

enter image description here

Inside foreach activity, pipeline2 is invoked with execute pipeline activity. pipeline parameter pp_sortorder is added in child pipeline pipeline2

enter image description here

In pipeline2, Lookup activity is added with dataset referring the config table with sortorder value got from pipeline1

select * from config_table where sortorder= @{pipeline().parameters.pp_sortorder}

enter image description here

Next to Lookup, Foreach is added and in items, lookup activity output is given and Batch count of 5 is given here (Batch count can be increased as per requirement)
Stored Procedure activity is added inside for each activity for checking parallel processing.

After setting up all these, **Pipeline 1 ** is executed. Execute pipeline activity of pipeline1 is run sequentially and Execute stored procedure activity of pipeline 2 has run simultaneously.

pipeline1 Output status second execute pipeline is started once first activity is ended enter image description here

pipeline2 Output status All Stored procedure activity has started to execute simultaneously enter image description here