Search code examples
azure-data-factory

Azure Data Factory - iterating files in pipeline vs data flows performance


This question came up because I have two different approaches to my ingestions:

  1. ADF Pipeline iterates through files
  2. ADF Data Flows runs through files based on wildcard filename/location

What I need is to be able to audit each file ingestion by recording the filename, source row count, target row count and status. I can do this fairly easily in pipelines, but I'm new to data flows and not sure how after all the branching, derived columns and so on, how you add the audit record "per" file.

My initial thought is to change my data flow so that it only handles one file at a time using parameters, then changing my pipeline so that it iterates over the list of files calling the data flow for each file. This allows me to do all the auditing in the pipeline.
I'm not sure if this is optimal though in terms of performance?


Solution

  • Your approach of using parameters to handle one file at a time in your data flow and iterating over the list of files in your pipeline is a good way to audit each file ingestion by recording the filename, source row count, target row count, and status.

    In terms of performance, this approach may be slower than processing multiple files at once in your data flow. However, the performance impact will depend on the size and complexity of your data flow and the number of files you are processing.