Search code examples
azure-data-factoryazure-blob-storagelast-modified

Get last modified of nested file


I am trying to get the last modified date of files in nested folders and only copy the files that have been modified in the last 10 days.

Folder structure:

container: home

product_data/

|---------20221020/

|---------|---------20221020/

|---------|---------|---------product_data_20221020_20221020.parquet

|---------20221021/

|---------|---------20221021/

|---------|---------|---------product_data_20221021_20221021.parquet

|---------20231102/

|---------|---------20231102/

|---------|---------|---------product_data_20231102_20231102.parquet

the 20231102 parquet file is the only one that should be copied because this file was last modified on Nov-7 (The last modified date does not match the date of the file).

I've messed with a similar issue before: Get Last Modified Date on Partitioned Data Using Azure Data Factory

My current issue is that I can't filter the files at all.

Image 1: Pipeline Overview pipeline overview

Image 2: Get Metadata Config get metadata config

Image 3: Parent Dataset (root folder) parent dataset (root folder)

Image 4: For Loop for loop

Image 5: Get Metadata inside the for loop Get file metadata inside the for loop

Image 6: Dataset for Get Metadata inside the for loop enter image description here

Image 7: Parameters for Dataset for Get Metadata inside the for loop enter image description here

Image 8: Get Metadata output inside the for loop

enter image description here

Because the "Filter by last modified" on the Get Metadata inside the foor loop doesn't seem to work, I also tried adding a filter and tried setting the variable (to debut), but both fail.

Filter Config items: @activity('Get Files Metadata').output.itemName Condition: @greater(activity('Get Files Metadata').output.lastModified, addMinutes(utcNow(), -30))

Image 9: Filter Output (Ignore the itemscount) enter image description here

Image 10: Filter Error

enter image description here

Image 11: Set Variable Config enter image description here

Image 12: Set Variable Error

enter image description here


Solution

  • To get the last modified date of files in nested folders you need to use get metadata activity and ForLoop with appropriate parameters.

    • First take a GetMetadata activity to get the subfolders of product_data folder. enter image description here Dataset For It: enter image description here
    • Pass Output of this activity to ForLoop activity to Iterate on each sub-folder. enter image description here
    • Inside ForLoop take a GetMetadata activity to get the Files from subfolders of product_data folder and get the files which are modified in last 10 days with Filter by last modified parameter where Start time is @getPastTime(10,'Day') and End time is utcNow(). enter image description here Dataset for it: enter image description here

    This will Give you array of Files from the folder which are modified in last 10 days.