Search code examples
sqlazure-data-factory

How can I sort incoming data based on date column to select only latest date data for IDs?


ADF/Data Flow-- I want to sort my incoming data based on the date column. My requirement is to take only latest date data(previous date data) for the ID, and ID can be duplicate with different dates. I have tries sorting based on both id and date column but it is not working for all the records. Can someone please help me to get this sorted in Data flow of Azure data factory!

I need to pick only those IDs which are having latest date from a Date column.Also how can I extract only day from date type column? example '23-05-2023' I want on 23 from this date column.


Solution

  • To pick only the latest date data for each ID, you can use the Aggregate transformation and group by ID column and use the Max function to get the maximum date value for each group. To extract only the day from a date type column, you can use the Derived Column transformation and use the dayOfMonth() function to extract the day value from the date column. Below are the steps.

    • Source transformation is taken with sample input.

    enter image description here

    • Then Aggregate Transformation is taken and aggregate settings is given as,
    group by: id
    aggregates: date = max(date) 
    

    enter image description here

    This groups the data by the ID column, and gets the maximum Date value for each group.

    • Then derived column transformation is taken and create a new column called for Day and give the expression as split(toString(date),'-')[3] or dayOfMonth(date) to extract the day value from date column enter image description here

    Derived column Transformation -Output:

    enter image description here

    Dataflow expression

    source(output(
    id as short,
    date as date
    ),
    allowSchemaDrift: true,
    validateSchema: false,
    ignoreNoFilesFound: false,
    dateFormats: ['dd-MM-yyyy']) ~> source1
    source1 aggregate(groupBy(id),
    date =  max(date)) ~> aggregate1
    aggregate1 derive(Day =  split(toString(date),'-')[3],
    column1 =  dayOfMonth(date)) ~> derivedColumn1