Search code examples
csvforeachazure-data-factoryfile-renameazure-data-lake-gen2

Copying and renaming files from different folders and same folder based on path from the configuration SQL table


I have a adls folder outbound which has sub folders in it and my partition CSV file is present in multiple sub folders of outbound folder.

PFA picture config_table_folder_path for the folder paths of my files which I'm fetching from my configuration table

1st pipeline : copies data from SQL tables to respective adls folders provided in the configuration table

These files are partition files which has to be renamed.

2nd pipeline : copies files from respective folder paths renames them and copies to the same folder path again and deletes the older files.

PFA picture of 2nd pipeline Main pipeline Execute pipeline

  1. Lookup = to fetch the folder path from my configuration SQL table

  2. Foreach = to loop through each path as there are multiple folder paths as you can see in the above config table picture

In order to loop through each file I have a execute pipeline

  1. Execute pipeline = has a parameter folderpath = item().DEST_FOLDER_PATH

3.1 Get metadata = to get all the files inside the folder

3.2. Foreach = to loop through each file

3.3 Copy activity = to rename each file present in the folders

3.4 Delete = since the copy activity creates a new file after renaming, to delete the older files

Csvdataset]4

Expected output of CSV file names after removing the underscore and adding the digits after underscore with hhmmss of my filename

MAIN ISSUE

The above approach works only when each file is in a different folder, but it throws error when there are multiple files in same folder I do not have any issue with the expected output, my issue is that my pipeline fails when the folder path is same for all files and throws an error while it iterates 2nd time which is not required for example if my folder is outbound/NL/historical and i have 5 files in this historical folder ,the foreach is running 5 times which i do not want as after the 1st loop itself my files would be renamed , it's executing 5 times and throwing error

I have tried not using getmetada but I need the getmetada to iterate through multiple files from same folders.

I have tried not using getmetada but I need the getmetada to iterate through multiple files from same folder .

I want my pipeline to work even if the multiple files are present in same folder and even when multiple files are present in multiple folders Copy source Copy sink delete error


Solution

  • Follow the below steps in the child pipeline to achieve your requirement.

    In the child pipeline Get meta data activity, take another dataset with parameters till only the folder path. Provide the folderpath pipeline parameter to it so that in every child pipeline run, the folder path changes as per the parent pipeline iteration.

    enter image description here

    Give the Child items array to the for-loop. In the copy activity source inside for-loop, you are using wild card file path *.csv which copies all files ends with .csv to the target(renamed file).

    It means, For example, if the source folder files are in order file1.csv,file2.csv, it will take all csv files and copies one by one into the same target location in one iteration. It results, the generated target file(renamed file)'s content after each iteration is only file2.csv's content because it overwrites the last source file content into the target file.

    So, you should not use wild card file path like above. Just use dataset parameters for container, folder path and filepath. Provide the necessary values for those in the copy activity source like below. Give the current file @item().name for the file path.

    enter image description here

    In the sink also do the same, but give your file rename expression for the filepath. Here, for sample, I have given @replace(item().name,'_','') to rename the files.

    enter image description here

    Next, to delete the current file, give the same values like copy activity source in the delete activity as well.

    enter image description here

    After creating new file in every iteration, this will delete the old file.

    Now, Debug the parent pipeline and it will give expected results.

    My source files before pipeline run:

    enter image description here

    New files after pipeline run:

    enter image description here

    If your pipeline is slow, you can uncheck the sequential check box in the for-loops, so that the iterations complete in parallel.