Search code examples
azureazure-data-factoryadfs

How to do two step sorting based on condition in dataflow ADF


I am working on ADF data flow and I am trying to implement a logic for two step sorting in my source data. The requirement is like, I have 1 primary key and 2 date columns : example- id, date1 and date2 Requirement- (i) if there is duplicate data with same id in source file then the row which has maximum date1 should be picked else, (ii) if there is duplicate data with same id and if the date1 is also same for them then the row which has maximum date2 should be picked and sent to output.

I tried giving two sorting one after another in Aggregate stage but currently I am getting random values from both duplicate rows which is wrong.

Can anyone help me to get this requirement? Thank you


Solution

  • In order to get the maximum date1 for the same ids and maximum date2 for same id,date1 combination, you have to first get the max date2 value for the same id and date1. Then check for the maximum date1 for id column. Below is the detailed approach.

    1. Use a Source transformation to read the data from your source file.

    Sample input

    id date1 date2
    1 2023-01-01 2023-01-03
    1 2023-01-02 2023-01-02
    2 2023-01-02 2023-01-01
    2 2023-01-02 2023-01-02

    The sample input that is taken has three columns: id, date1, and date2.

    1. Use an Aggregate transformation to group the data by the id and date1 columns and calculate the maximum value of date2 for each group. This will ensure that for each id and date1 combination, you get the maximum value of date2. You can use the following expression in the Aggregate transformation:

      groupBy(id, date1),
      date2 = max(date2) 
      

    gif1

    The output of this transformation will have three columns: id, date1, and date2 (max value of date2).

    id date1 date2
    1 2023-01-01 2023-01-03
    1 2023-01-02 2023-01-02
    2 2023-01-02 2023-01-02
    1. Use another Aggregate transformation to group the data by the id column and calculate the maximum values of date1 for each id group. This will ensure that for each id, you get the maximum values of date1. You can use the following expression in the Aggregate transformation:

      groupBy(id),
      date1 = max(date1)
      

    gif1

    id date1
    1 2023-01-02
    2 2023-01-02
    • Then use the Join transformation to join the output of the two Aggregate transformations based on the id and date1 columns.

    gif1

    • Use the select transformation to select the id, date1, and date2 columns from the output of the Join transformation and remove duplicate fields.
    id date1 date2
    1 2023-01-02 2023-01-02
    2 2023-01-02 2023-01-02

    This will ensure that you get the rows that satisfy your two-step sorting logic.