In Azure DataFactory, I want to delete all files at a particular path that do not contain a substring (i.e. '2024-04-15').
I managed to do it with a loop activity but it is too expensive when the amount of files at the particular path is big (around 1000).
Thus I wanted to test wildcards, but I am stuck trying to use the not operator for regexes.
For example, if I have 3 files at the path: [lorem_2024-04-15.csv, lorem_2024-04-14_test.csv, lorem_2024-04-14.csv]
, I want to delete all files but lorem_2024-04-15.csv
. I tried:
Wildcard file name: *^(?!.*2024-04-15).*
But when trying to preview data for the dataset pointing to that path:
The escaped character ^( is not recognized for wildcard filter (*, ?).
If you want to use ^ in file name, specify ^^ to escape. Index: 2.
I managed to do it with a loop activity, but it is too expensive when the amount of files at the particular path is large (around 1000).
If you don't want to use a loop to delete the required files, you can follow the procedure below:
List all the required files that need to be deleted in one file using a data flow as follows:
Add the required dataset as the source of the data flow with a wildcard path and add a column filename
in the source settings, as shown below:
You will get the file path as shown below:
Next, use an aggregate transformation and use groupBy on the filename
column. In the aggregate section, create a new column filenamecount
and take a count of the filename (count(filename)
) column for a sample.
It will give all distinct file paths and their count. Next, use a select transformation and remove the extra filenamecount
column from the flow.
Now, use a derivedColumn transformation to the filename
column and use the expression dropLeft(filename,1)
in it. This will remove the starting /
in every row.
Now, use a filter transformation and use the expression not(like(filename, "%2024-04-15%"))
to exclude filenames that contain 2024-04-15
.
Next, take another DelimitedText
dataset as the sink of the data flow and give the file path. Here, make sure you remove the First row as Header checkbox.
Configure sink settings as shown below:
After executing the data flow, the file will generate a filename.csv with the required file paths as shown below:
After the data flow activity, add a delete activity, select the dataset, and opt for the List of files and browse the filename.csv file.
After the execution of the delete activity, it will delete all files except those containing 2024-04-15 as shown below: