Search code examples
sqlmysqldateaggregateoverlap

Aggregate rows only if dates range overlap


I am trying to build a complex sql query for hours but still didn't find any way to do it as expected.

Here is my table and my dataset :

create table Skills
(
ID varchar(10),
StartDate date,
EndDate date,
Skill varchar(10)
);

Insert into Skills values
('1','2021-01-01','2021-12-31','A'),
('1','2022-01-01','2022-12-31','B'),
('2','2021-01-01','2021-12-31','A'),
('2','2021-11-30','2022-12-31','B'),
('3','2021-01-01','2021-12-31','A'),
('3','2021-11-30','2022-12-31','B'),
('3','2022-11-30','2023-12-31','C'),
('4','2021-01-01','2021-12-31','A'),
('4','2022-01-01','2022-12-31','B'),
('4','2022-11-30','2023-12-31','C');

I would like to aggregate rows by ID only when dates range (StartDate, EndDate) overlap. Here is the expected result :

1, 2021-01-01, 2021-12-31, A
1, 2022-01-01, 2022-12-31, B
2, 2021-01-01, 2022-12-31, B
3, 2021-01-01, 2023-12-31, C
4, 2021-01-01, 2021-12-31, A
4, 2022-01-01, 2023-12-31, C

When rows with overlapping dates range are aggregated, we need to keep the oldest StartDate, the newest EndDate and the Skill associated to the newest EndDate.

I tried so many queries with partition by, lag, cte, etc.

Could you help me find the right solution please ?

Thanks, Regards


Solution

  • This is a gaps and islands problem, to solve it you can use lag() to determine where the "islands" start, Then use a cumulative sum() to determine gaps :

    Assuming the endDate is unique per id :

    select d.*, s.skill
    from (
      select d.id, min(d.start_date) as start_date, max(d.end_date) as end_date
      from (
          select d.*,
          sum(case when DATEDIFF(prev_end_date, start_date) > 0 then 0 else 1 end)  over (partition by id order by start_date) as grp
          from (
                select d.*,
                lag(end_date) over (partition by id order by start_date) as prev_end_date
                from Skills d
          ) d
      ) d
      group by d.id, grp
    ) d
    inner join Skills s on s.id = d.id and s.end_date = d.end_date
    

    Demo here