Search code examples
azure-data-factoryetlstaging

Meta Data driven ADF pipeline to ingestion data from multiple sources


I am designing ADF solution to read multiple source folder from DataLake raw layer, I need to build a config file where I can manage dynamically which file/folder to load, and date range to load

i.e. The structure of the folder in the container as follow:

enter image description here

Like above screenshot I need to load the file into raw to staging layer dynamically having config file assigning the path and file types. we are receiving multiple types of files csv, json, etc.

any solution please share

I am starting new design


Solution

  • To achieve your requirement, first you need to generate a file containing the list of required source file paths.

    To generate that file, use dataflow. First take a DelimitedText file and give the path till your root container with no Column delimiter in it. Give this dataset as source of the dataflow.

    enter image description here

    You mentioned that your source files are in the directory structure like root(crm)/src1/2024/03/22/. So, give the wild card file path as per the structure */*/*/*/*. Add a column filepath in the source settings.

    enter image description here

    Import the projection and check the data preview. You will get two columns in which one contains the file paths of all types of source files.

    enter image description here

    Next, use aggregate transformation and use groupBy on the filepath column. In the aggregate section create a new column count and take count of filepath (count(filepath)) column for sample.

    enter image description here

    It will give all distinct filepaths and their count. Next use a select transformation and remove the extra count column from the flow.

    enter image description here

    Now, take a derivedColumn transformation to the filepath column and use expression dropLeft(filepath,1) in it. This will remove starting / in every row.

    enter image description here

    Now, take a filter transformation and filter out the required file paths. You can use Dataflow expression and functions as per your date ranges condition in the filter transformation.

    Next, take another DelimitedText dataset as sink of the dataflow and give the file path till the temporary container in the dataset. Here, make sure you remove the First row as Header checkbox.

    enter image description here

    Go to the sink settings of the dataflow and follow the below configurations.

    enter image description here

    Upon executing the dataflow from the pipeline, the filepaths.csv file will be generated like below with required filepaths.

    enter image description here

    After the dataflow activity, take a copy activity in the pipeline with two Binary datasets as source and sink of it. Give the path till the source container (root crm) in the source Binary dataset and till the target root container in the target Binary dataset.

    In the copy activity source, select List of files option and give the above filepaths.csv file.

    enter image description here

    Now, execute the pipeline and all the paths in the filepaths.csv file will be copied to the target container maintaining the same file structure.

    enter image description here