Search code examples
azureazure-databricksazure-data-factory

Data Factory - Foreach activity: run in parallel but sequentially


I'm creating a ADF pipeline and I'm using a for each activity to run multiple databricks notebook.

My problem is that two notebooks have dependencies on each other.

That is, a notebook has to run before the other, because it has dependency. I know that the for each activity can be executed sequentially and by batch. But the problem is that when running sequentially it will run one by one, that is, as I have partitions, it will take a long time.

What I wanted is to run sequentially but by batch. In other words, I have a notebook that will run with ES, UK, DK partitions, and I wanted it to run in parallel these partitions of this notebook and to wait for the total execution of this notebook and only then would it start to run the other notebook by the same partitions. If I put it by batch, it doesn't wait for full execution, it starts running the other notebook randomly.

The part of the order of notebooks I get through a config table, in which I specify which order they should run and then I have a notebook that defines my final json with that order.

Config Table:

sPath TableSource TableDest order
path1 dbo.table1 dbo.table1 1
path2 dbo.table2 dbo.table2 2

This is my pipeline: enter image description here

and the execution I wanted by batch and sequentially but it is not possible to select by sequential and batch count at the same time.

Can anyone please help me in achieving this?

Thank you!


Solution

  • I have tried to reproduce this. For running for-each sequentially and batchwise, we need to have two pipelines- one nested inside the other pipeline. Outer pipeline is used for running sequentially and inner pipeline is for running batchwise. Below are the steps

    • Took a sample config file as in below image.

    config_table

    • Pipeline 1 is considered as the outer pipeline. In that Lookup activity is used to select only sortorder field data in increasing order. Sortorder value will be passed as a parameter to child pipeline sequentially.

    select distinct sortorder from config_table order by sortorder

    Lookup

    • For each activity is added after the lookup activity. We use this for sequential run. Thus, Sequential is checked and in items text box, output of lookup activity is given.

    @activity('Lookup1').output.value

    enter image description here

    • Inside foreach activity, pipeline2 is invoked with execute pipeline activity. pipeline parameter pp_sortorder is added in child pipeline pipeline2

    enter image description here

    enter image description here

    • In pipeline2, Lookup activity is added with dataset referring the config table with sortorder value got from pipeline1

    select * from config_table where sortorder= @{pipeline().parameters.pp_sortorder}

    enter image description here

    • Next to Lookup, Foreach is added and in items, lookup activity output is given and Batch count of 5 is given here (Batch count can be increased as per requirement) enter image description here

    • Stored Procedure activity is added inside for each activity for checking parallel processing. enter image description here

    After setting up all these, **Pipeline 1 ** is executed. Execute pipeline activity of pipeline1 is run sequentially and Execute stored procedure activity of pipeline 2 has run simultaneously.

    pipeline1 Output status second execute pipeline is started once first activity is ended enter image description here

    pipeline2 Output status All Stored procedure activity has started to execute simultaneously enter image description here