I am designing ADF solution to read multiple source folder from DataLake raw layer, I need to build a config file where I can manage dynamically which file/folder to load, and date range to load
i.e. The structure of the folder in the container as follow:
Like above screenshot I need to load the file into raw to staging layer dynamically having config file assigning the path and file types. we are receiving multiple types of files csv, json, etc.
any solution please share
I am starting new design
To achieve your requirement, first you need to generate a file containing the list of required source file paths.
To generate that file, use dataflow. First take a DelimitedText
file and give the path till your root container with no Column delimiter in it. Give this dataset as source of the dataflow.
You mentioned that your source files are in the directory structure like root(crm)/src1/2024/03/22/
. So, give the wild card file path as per the structure */*/*/*/*
. Add a column filepath
in the source settings.
Import the projection and check the data preview. You will get two columns in which one contains the file paths of all types of source files.
Next, use aggregate transformation and use groupBy on the filepath
column. In the aggregate section create a new column count
and take count of filepath (count(filepath)
) column for sample.
It will give all distinct filepaths and their count. Next use a select transformation and remove the extra count
column from the flow.
Now, take a derivedColumn transformation to the filepath
column and use expression dropLeft(filepath,1)
in it. This will remove starting /
in every row.
Now, take a filter transformation
and filter out the required file paths. You can use Dataflow expression and functions as per your date ranges condition in the filter transformation.
Next, take another DelimitedText
dataset as sink of the dataflow and give the file path till the temporary container in the dataset. Here, make sure you remove the First row as Header checkbox.
Go to the sink settings of the dataflow and follow the below configurations.
Upon executing the dataflow from the pipeline, the filepaths.csv
file will be generated like below with required filepaths.
After the dataflow activity, take a copy activity in the pipeline with two Binary datasets as source and sink of it. Give the path till the source container (root crm
) in the source Binary dataset and till the target root container in the target Binary dataset.
In the copy activity source, select List of files option and give the above filepaths.csv
file.
Now, execute the pipeline and all the paths in the filepaths.csv
file will be copied to the target container maintaining the same file structure.