Search code examples
azure-data-factory

Azure Data Factory - Controlling order of transformations within a dataflow


I'm trying to determine if there is some way to control the order of transformations WITHIN a dataflow in Azure Data Factory. I know that this can be done with multiple data flows in a pipeline but can you control the order of transformations within a data flow to ensure that one transformation does not process until another transformation has completed?

I have a very large data flow and I've noticed that as it's grown, the output can change between runs even though I've made no changes to the source data or the transformations in the data flow itself. I suspect this is because the multiple input streams run independently and a transformation is one stream that I reference in a second stream may or may not have completed before it is referenced in the second stream.

Here is a very simple example. I need some way to ensure that the 'AggregateTheData' transformation is completed before the 'SampleJoin' transformation is processed because the 'SampleJoin' transformation uses 'AggregateTheData' as one of its join inputs.

Example ADF dataflow


Solution

  • When you tried to use the output of any transformation, dataflow waits for the output of that transformation and then executes the current transformation.

    I have tried same scenario as above.

    enter image description here

    sink1 creates file lookup2.csv and sink2 creates display2.csv files.

    You can see I got the expected results in the display2.csv file.

    enter image description here

    Also, by the creation time of the files, it's clear that the above join transformation was executed after the aggregate transformation.

    enter image description here

    You can also set the sink write order like below to ensure the sink1 written first.

    enter image description here