Search code examples
sqldatabricks

Grouping and collapsing consecutive periods


The business requirement is to collapse consecutive periods into 1 row using SQL in Databricks. If a row's end date is one day before the next rows start date, then these 2 rows are considered to be the same period.

Here is a demo data set:

row id start_date end_date
1 a 2024-05-03 2024-05-06
2 a 2024-05-07 2024-05-10
3 a 2024-05-13 2024-05-17
4 a 2024-05-18 2024-05-22
5 a 2024-05-23 2024-06-27

In this data set, row 1 and 2 should be of the same period, and so do row 3 to 5.

The desired output is:

row id start_date end_date
1 a 2024-05-03 2024-05-10
2 a 2024-05-13 2024-06-27

I tried to use the lead and lag window functions to get previous rows' end date but this approach doesn't work if you have more than 2 rows of consecutive periods.

Thanks very much for your help


Solution

  • This is typically known as Gaps and Island problem. You need to create a dummy column to identify your continuous date range and then find the min and max values based on that column -

    SELECT id, MIN(start_date) start_date, MAX(end_date) end_date
      FROM (SELECT id, start_date, end_date, SUM(rn) OVER(PARTITION BY id ORDER BY start_date) sum_flag
              FROM (SELECT id, start_date, end_date,
                           CASE WHEN start_date - 1 <> LAG(end_date) OVER(PARTITION BY id ORDER BY start_date)
                                     THEN 1
                                ELSE 0
                           END rn
                     FROM demo
                   ) T
           ) X
     GROUP BY id, sum_flag;
    

    Demo.