i have a GetMetaData function in my pipeline which gives me the childitems of the files inside a datalake 2. There are over 100 files in there. I only need a list of the top 30 (filtered by the name) files. How can i achieve that? I have no clue. I am pretty new to azure and still learnin.
The filename looks like this:
"productx_20220301T000000.csv" "productx_20220302T000000.csv" "productx_20220303T000000.csv" ... "productx_20230329T000000.csv"
I want the top 30 files after name (date). e.g. "productx_20230329T000000.csv" "productx_20230328T000000.csv" "productx_20230327T000000.csv" ...
If you have an idea i would appreciate a short explanation in the best case with a picture :)
thanks everyone for the help.
If you have files on every day(as per date in the file name) without missing any days in between the last 30 days, then you can try the below approach.
First I have generated last n
dates array in yyyyMMdd
format using until activity. Then I have used filter activity on Get Meta data child items array. In this filter, it checks whether the dates array
contains the extracted date from file name of child item or not.
These are my sample input files:
Here, I have filtered for latest 4 days i.e, 20230330
,20230329
,20230328
,20230327
files.
This is my pipeline flow:
dates array in yyyyMMdd
format for 4 days:
This is my pipeline JSON for your reference:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Get Metadata1",
"type": "GetMetadata",
"dependsOn": [
{
"activity": "Generates dates array until n days",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"dataset": {
"referenceName": "source_csvfiles",
"type": "DatasetReference"
},
"fieldList": [
"childItems"
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
}
},
{
"name": "intialize n-1 for n days",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "daysiterator",
"value": {
"value": "3",
"type": "Expression"
}
}
},
{
"name": "Generates dates array until n days",
"type": "Until",
"dependsOn": [
{
"activity": "intialize n-1 for n days",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"expression": {
"value": "@equals(variables('daysiterator'), '-1')",
"type": "Expression"
},
"activities": [
{
"name": "Append to days array",
"type": "AppendVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "lastndays",
"value": {
"value": "@formatDateTime(subtractFromTime(utcnow(), int(variables('daysiterator')), 'Day'),'yyyyMMdd')",
"type": "Expression"
}
}
},
{
"name": "decrement and store in temp",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Append to days array",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "temp",
"value": {
"value": "@string(add(int(variables('daysiterator')), -1))",
"type": "Expression"
}
}
},
{
"name": "reassign temp to daysterator",
"type": "SetVariable",
"dependsOn": [
{
"activity": "decrement and store in temp",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "daysiterator",
"value": {
"value": "@variables('temp')",
"type": "Expression"
}
}
}
],
"timeout": "0.12:00:00"
}
},
{
"name": "Filter1",
"type": "Filter",
"dependsOn": [
{
"activity": "Get Metadata1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@activity('Get Metadata1').output.childItems",
"type": "Expression"
},
"condition": {
"value": "@contains(variables('lastndays'),substring(item().name, add(indexof(item().name,'_'),1), 8))",
"type": "Expression"
}
}
}
],
"variables": {
"lastndays": {
"type": "Array"
},
"daysiterator": {
"type": "String"
},
"temp": {
"type": "String"
}
},
"annotations": []
}
}
Filter activity result array:
If your files missing on some days, then try the below approach.
int(substring(item().name, add(indexof(item().name,'_'),1), 8))
).@contains(take(variables('Sorted dates array'),30),int(substring(item().name, add(indexof(item().name,'_'),1), 8)))
.