Search code examples
azureazure-data-factoryazure-storage-account

Get all folder names in subfolders Azure Data factory


I have a below Folder Structure in Data Lake;

datasetname/fullload/year/month/day/hour/min/sec/data

enter image description here

I can't create Azure function or databrick. just simple adf activity

I want to get the latest folder names from all subfolders of my ParentFolder directory (datasetname/fullload) . I tried GetMetadata -> set variable then loop but still not working

enter image description here

enter image description here

I need to get the latest path of the folder in the blob storage

Thanks


Solution

    • Since you need the latest data and the folders are mostly numbers, you can find the greatest number in each sub folder to find the latest data.
    • I have file data as shown in the below image:

    enter image description here

    • To find the greatest folder, I have used 2 pipelines. pipeline1 is used to iterate and get child items until child items don't exist. pipeline2 is to find the maximum number for the list of sub-folder names in a particular folder.

    • The following is the pipeline JSON for pipeline1:

    {
        "name": "pipeline1",
        "properties": {
            "activities": [
                {
                    "name": "get path",
                    "type": "Until",
                    "dependsOn": [
                        {
                            "activity": "Set flag",
                            "dependencyConditions": [
                                "Succeeded"
                            ]
                        }
                    ],
                    "userProperties": [],
                    "typeProperties": {
                        "expression": {
                            "value": "@equals(variables('flag'),'true')",
                            "type": "Expression"
                        },
                        "activities": [
                            {
                                "name": "Get sub folders",
                                "type": "GetMetadata",
                                "dependsOn": [],
                                "policy": {
                                    "timeout": "0.12:00:00",
                                    "retry": 0,
                                    "retryIntervalInSeconds": 30,
                                    "secureOutput": false,
                                    "secureInput": false
                                },
                                "userProperties": [],
                                "typeProperties": {
                                    "dataset": {
                                        "referenceName": "root",
                                        "type": "DatasetReference",
                                        "parameters": {
                                            "path": {
                                                "value": "@variables('path')",
                                                "type": "Expression"
                                            }
                                        }
                                    },
                                    "fieldList": [
                                        "childItems"
                                    ],
                                    "storeSettings": {
                                        "type": "AzureBlobFSReadSettings",
                                        "enablePartitionDiscovery": false
                                    },
                                    "formatSettings": {
                                        "type": "DelimitedTextReadSettings"
                                    }
                                }
                            },
                            {
                                "name": "If Condition1",
                                "type": "IfCondition",
                                "dependsOn": [
                                    {
                                        "activity": "Get sub folders",
                                        "dependencyConditions": [
                                            "Succeeded"
                                        ]
                                    }
                                ],
                                "userProperties": [],
                                "typeProperties": {
                                    "expression": {
                                        "value": "@greater(length(activity('Get sub folders').output.childItems),0)",
                                        "type": "Expression"
                                    },
                                    "ifFalseActivities": [
                                        {
                                            "name": "Set variable1",
                                            "type": "SetVariable",
                                            "dependsOn": [],
                                            "policy": {
                                                "timeout": "0.12:00:00",
                                                "retry": 0,
                                                "retryIntervalInSeconds": 30,
                                                "secureOutput": false,
                                                "secureInput": false
                                            },
                                            "userProperties": [],
                                            "typeProperties": {
                                                "variableName": "flag",
                                                "value": {
                                                    "value": "true",
                                                    "type": "Expression"
                                                }
                                            }
                                        }
                                    ],
                                    "ifTrueActivities": [
                                        {
                                            "name": "get latest",
                                            "type": "ExecutePipeline",
                                            "dependsOn": [],
                                            "userProperties": [],
                                            "typeProperties": {
                                                "pipeline": {
                                                    "referenceName": "pipeline2",
                                                    "type": "PipelineReference"
                                                },
                                                "waitOnCompletion": true,
                                                "parameters": {
                                                    "array_to_find_max": {
                                                        "value": "@activity('Get sub folders').output.childItems",
                                                        "type": "Expression"
                                                    }
                                                }
                                            }
                                        },
                                        {
                                            "name": "append max to path",
                                            "type": "SetVariable",
                                            "dependsOn": [
                                                {
                                                    "activity": "get latest",
                                                    "dependencyConditions": [
                                                        "Succeeded"
                                                    ]
                                                }
                                            ],
                                            "policy": {
                                                "timeout": "0.12:00:00",
                                                "retry": 0,
                                                "retryIntervalInSeconds": 30,
                                                "secureOutput": false,
                                                "secureInput": false
                                            },
                                            "userProperties": [],
                                            "typeProperties": {
                                                "variableName": "tp",
                                                "value": {
                                                    "value": "@{variables('path')}/@{activity('get latest').output.pipelineReturnValue.max_val}",
                                                    "type": "Expression"
                                                }
                                            }
                                        },
                                        {
                                            "name": "update path",
                                            "type": "SetVariable",
                                            "dependsOn": [
                                                {
                                                    "activity": "append max to path",
                                                    "dependencyConditions": [
                                                        "Succeeded"
                                                    ]
                                                }
                                            ],
                                            "policy": {
                                                "timeout": "0.12:00:00",
                                                "retry": 0,
                                                "retryIntervalInSeconds": 30,
                                                "secureOutput": false,
                                                "secureInput": false
                                            },
                                            "userProperties": [],
                                            "typeProperties": {
                                                "variableName": "path",
                                                "value": {
                                                    "value": "@variables('tp')",
                                                    "type": "Expression"
                                                }
                                            }
                                        }
                                    ]
                                }
                            }
                        ],
                        "timeout": "0.12:00:00"
                    }
                },
                {
                    "name": "Set flag",
                    "type": "SetVariable",
                    "dependsOn": [
                        {
                            "activity": "Set path",
                            "dependencyConditions": [
                                "Succeeded"
                            ]
                        }
                    ],
                    "policy": {
                        "timeout": "0.12:00:00",
                        "retry": 0,
                        "retryIntervalInSeconds": 30,
                        "secureOutput": false,
                        "secureInput": false
                    },
                    "userProperties": [],
                    "typeProperties": {
                        "variableName": "flag",
                        "value": {
                            "value": "false",
                            "type": "Expression"
                        }
                    }
                },
                {
                    "name": "Set path",
                    "type": "SetVariable",
                    "dependsOn": [],
                    "policy": {
                        "timeout": "0.12:00:00",
                        "retry": 0,
                        "retryIntervalInSeconds": 30,
                        "secureOutput": false,
                        "secureInput": false
                    },
                    "userProperties": [],
                    "typeProperties": {
                        "variableName": "path",
                        "value": {
                            "value": "data/f1/ff1",
                            "type": "Expression"
                        }
                    }
                }
            ],
            "variables": {
                "path": {
                    "type": "String"
                },
                "flag": {
                    "type": "String"
                },
                "values": {
                    "type": "Array"
                },
                "max_val": {
                    "type": "String"
                },
                "tp": {
                    "type": "String"
                }
            },
            "annotations": []
        }
    }
    
    • The following is the pipeline JSON for pipeline2:
    {
        "name": "pipeline2",
        "properties": {
            "activities": [
                {
                    "name": "make array of values",
                    "type": "ForEach",
                    "dependsOn": [],
                    "userProperties": [],
                    "typeProperties": {
                        "items": {
                            "value": "@pipeline().parameters.array_to_find_max",
                            "type": "Expression"
                        },
                        "isSequential": true,
                        "activities": [
                            {
                                "name": "append each value",
                                "type": "AppendVariable",
                                "dependsOn": [],
                                "userProperties": [],
                                "typeProperties": {
                                    "variableName": "values",
                                    "value": {
                                        "value": "@int(item().name)",
                                        "type": "Expression"
                                    }
                                }
                            }
                        ]
                    }
                },
                {
                    "name": "return max",
                    "type": "SetVariable",
                    "dependsOn": [
                        {
                            "activity": "make array of values",
                            "dependencyConditions": [
                                "Succeeded"
                            ]
                        }
                    ],
                    "policy": {
                        "timeout": "0.12:00:00",
                        "retry": 0,
                        "retryIntervalInSeconds": 30,
                        "secureOutput": false,
                        "secureInput": false
                    },
                    "userProperties": [],
                    "typeProperties": {
                        "variableName": "pipelineReturnValue",
                        "value": [
                            {
                                "key": "max_val",
                                "value": {
                                    "type": "Expression",
                                    "content": "@if(equals(length(string(max(variables('values')))),1),concat('0',string(max(variables('values')))),string(max(variables('values'))))"
                                }
                            }
                        ],
                        "setSystemVariable": true
                    }
                }
            ],
            "parameters": {
                "array_to_find_max": {
                    "type": "array",
                    "defaultValue": [
                        {
                            "name": "2022",
                            "type": "Folder"
                        },
                        {
                            "name": "2023",
                            "type": "Folder"
                        }
                    ]
                }
            },
            "variables": {
                "values": {
                    "type": "Array"
                },
                "max_val": {
                    "type": "String"
                },
                "tp": {
                    "type": "String"
                }
            },
            "annotations": []
        }
    }
    
    • The following is the dataset configuration that I used for get metadata activity. The initial value of path in my case is data/f1/ff1 and its value would be updated (greatest folder name would be concatenated):

    enter image description here

    • When I run this pipeline, I get the desired results. After the until loop stops, the variable path has the required path i.e., the path to latest data:

    enter image description here