Search code examples
azure-data-factorywildcard

Is it possible to use ^ operator in wildcards for Azure DataFactory delete activity?


In Azure DataFactory, I want to delete all files at a particular path that do not contain a substring (i.e. '2024-04-15').

I managed to do it with a loop activity but it is too expensive when the amount of files at the particular path is big (around 1000).

Thus I wanted to test wildcards, but I am stuck trying to use the not operator for regexes.

For example, if I have 3 files at the path: [lorem_2024-04-15.csv, lorem_2024-04-14_test.csv, lorem_2024-04-14.csv], I want to delete all files but lorem_2024-04-15.csv. I tried:

Wildcard file name: *^(?!.*2024-04-15).*

But when trying to preview data for the dataset pointing to that path:

The escaped character ^( is not recognized for wildcard filter (*, ?). 
If you want to use ^ in file name, specify ^^ to escape. Index: 2.

enter image description here


Solution

  • I managed to do it with a loop activity, but it is too expensive when the amount of files at the particular path is large (around 1000).

    If you don't want to use a loop to delete the required files, you can follow the procedure below:

    List all the required files that need to be deleted in one file using a data flow as follows:

    Add the required dataset as the source of the data flow with a wildcard path and add a column filename in the source settings, as shown below:

    enter image description here

    You will get the file path as shown below:

    enter image description here

    Next, use an aggregate transformation and use groupBy on the filename column. In the aggregate section, create a new column filenamecount and take a count of the filename (count(filename)) column for a sample.

    enter image description here

    It will give all distinct file paths and their count. Next, use a select transformation and remove the extra filenamecount column from the flow.

    enter image description here

    Now, use a derivedColumn transformation to the filename column and use the expression dropLeft(filename,1) in it. This will remove the starting / in every row.

    enter image description here

    Now, use a filter transformation and use the expression not(like(filename, "%2024-04-15%")) to exclude filenames that contain 2024-04-15.

    enter image description here

    Next, take another DelimitedText dataset as the sink of the data flow and give the file path. Here, make sure you remove the First row as Header checkbox.

    enter image description here

    Configure sink settings as shown below:

    enter image description here

    After executing the data flow, the file will generate a filename.csv with the required file paths as shown below:

    enter image description here

    After the data flow activity, add a delete activity, select the dataset, and opt for the List of files and browse the filename.csv file.

    enter image description here

    After the execution of the delete activity, it will delete all files except those containing 2024-04-15 as shown below:

    enter image description here