I'm creating a ADF pipeline and I'm using a for each activity to run multiple databricks notebook.
My problem is that two notebooks have dependencies on each other.
That is, a notebook has to run before the other, because it has dependency. I know that the for each activity can be executed sequentially and by batch. But the problem is that when running sequentially it will run one by one, that is, as I have partitions, it will take a long time.
What I wanted is to run sequentially but by batch. In other words, I have a notebook that will run with ES, UK, DK partitions, and I wanted it to run in parallel these partitions of this notebook and to wait for the total execution of this notebook and only then would it start to run the other notebook by the same partitions. If I put it by batch, it doesn't wait for full execution, it starts running the other notebook randomly.
The part of the order of notebooks I get through a config table, in which I specify which order they should run and then I have a notebook that defines my final json with that order.
Config Table:
sPath | TableSource | TableDest | order |
---|---|---|---|
path1 | dbo.table1 | dbo.table1 | 1 |
path2 | dbo.table2 | dbo.table2 | 2 |
and the execution I wanted by batch and sequentially but it is not possible to select by sequential and batch count at the same time.
Can anyone please help me in achieving this?
Thank you!
I have tried to reproduce this. For running for-each sequentially and batchwise, we need to have two pipelines- one nested inside the other pipeline. Outer pipeline is used for running sequentially and inner pipeline is for running batchwise. Below are the steps
select distinct sortorder from config_table order by sortorder
@activity('Lookup1').output.value
select * from config_table where sortorder= @{pipeline().parameters.pp_sortorder}
Next to Lookup, Foreach is added and in items, lookup activity output is given and Batch count of 5 is given here (Batch count can be increased as per requirement)
Stored Procedure activity is added inside for each activity for checking parallel processing.
After setting up all these, **Pipeline 1 ** is executed. Execute pipeline activity of pipeline1 is run sequentially and Execute stored procedure activity of pipeline 2 has run simultaneously.
pipeline1 Output status
second execute pipeline is started once first activity is ended
pipeline2 Output status
All Stored procedure activity has started to execute simultaneously