Search code examples
sqlsnowflake-cloud-data-platformwindow-functionsranking-functions

How to merge rows startdate enddate based on column values using Lag Lead or window functions?


I have a table with 4 columns: ID, STARTDATE, ENDDATE and BADGE. I want to merge rows based on ID and BADGE values but make sure that only consecutive rows will get merged.

For example, If input is:

enter image description here

Output will be:

enter image description here

I have tried lag lead, unbounded, bounded precedings but unable to achieve the output:

SELECT ID, 
       STARTDATE, 
       MAX(ENDDATE), 
       NAME 
FROM (SELECT USERID, 
             IFF(LAG(NAME) over(Partition by USERID Order by STARTDATE) = NAME, 
                 LAG(STARTDATE) over(Partition by USERID Order by STARTDATE), 
                 STARTDATE) AS STARTDATE, 
             ENDDATE, 
             NAME 
      from myTable ) 
GROUP BY USERID, 
         STARTDATE, 
         NAME

We have to make sure that we merge only consective rows having same ID and Badge.

Help will be appreciated, Thanks.


Solution

  • You can split the problem into two steps:

    • creating the right partitions
    • aggregating on the partitions with direct aggregation functions (MIN and MAX)

    You can approach the first step using a boolean field that is 1 when there's no consecutive date match (row1.ENDDATE = row2.STARTDATE + 1 day). This value will indicate when a new partition should be created. Hence if you compute a running sum, you should have your correctly numbered partitions.

    WITH cte AS (
        SELECT *,
               IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
        FROM tab 
    )
    SELECT *
           SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
    FROM cte
    

    Then the second step can be summarized in the direct aggregation of "STARTDATE" and "ENDDATE" using the MIN and MAX function respectively, grouping on your ranking value. For syntax correctness, you need to add "ID" and "Badge" too in the GROUP BY clause, even though their range of action is already captured by the computed ranking value.

    WITH cte AS (
        SELECT *,
               IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
        FROM tab 
    ), cte2 AS (
        SELECT *,
               SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
        FROM cte
    )
    SELECT ID,
           MIN(STARTDATE) AS STARTDATE,
           MAX(ENDDATE) AS ENDDATE,
           Badge
    FROM cte2
    GROUP BY ID,
             Badge,
             rn