I am trying to build a complex sql query for hours but still didn't find any way to do it as expected.
Here is my table and my dataset :
create table Skills
(
ID varchar(10),
StartDate date,
EndDate date,
Skill varchar(10)
);
Insert into Skills values
('1','2021-01-01','2021-12-31','A'),
('1','2022-01-01','2022-12-31','B'),
('2','2021-01-01','2021-12-31','A'),
('2','2021-11-30','2022-12-31','B'),
('3','2021-01-01','2021-12-31','A'),
('3','2021-11-30','2022-12-31','B'),
('3','2022-11-30','2023-12-31','C'),
('4','2021-01-01','2021-12-31','A'),
('4','2022-01-01','2022-12-31','B'),
('4','2022-11-30','2023-12-31','C');
I would like to aggregate rows by ID only when dates range (StartDate, EndDate) overlap. Here is the expected result :
1, 2021-01-01, 2021-12-31, A
1, 2022-01-01, 2022-12-31, B
2, 2021-01-01, 2022-12-31, B
3, 2021-01-01, 2023-12-31, C
4, 2021-01-01, 2021-12-31, A
4, 2022-01-01, 2023-12-31, C
When rows with overlapping dates range are aggregated, we need to keep the oldest StartDate, the newest EndDate and the Skill associated to the newest EndDate.
I tried so many queries with partition by, lag, cte, etc.
Could you help me find the right solution please ?
Thanks, Regards
This is a gaps and islands problem, to solve it you can use lag()
to determine where the "islands" start, Then use a cumulative sum()
to determine gaps :
Assuming the endDate
is unique per id
:
select d.*, s.skill
from (
select d.id, min(d.start_date) as start_date, max(d.end_date) as end_date
from (
select d.*,
sum(case when DATEDIFF(prev_end_date, start_date) > 0 then 0 else 1 end) over (partition by id order by start_date) as grp
from (
select d.*,
lag(end_date) over (partition by id order by start_date) as prev_end_date
from Skills d
) d
) d
group by d.id, grp
) d
inner join Skills s on s.id = d.id and s.end_date = d.end_date