sql postgresql window-functions gaps-and-islands

PostgreSQL cluster uninterrupted series of data

here's an example of the input and the desired output, I think this is easier to grasp than me describing the need.

input data (business logic explanation: everything with a duration > 1M is set to 0M)

name	start	end	difference	ongoing_sequence
test 1	2024-04-21 00:00:21	2024-04-21 00:01:21	00:01:00	1
test 1	2024-04-21 00:01:21	2024-04-21 00:02:21	00:01:00	1
test 1	2024-04-21 00:02:21	2024-04-21 00:03:21	00:01:00	1
test 1	2024-04-21 00:03:21	2024-04-21 00:15:21	00:00:00	0
test 1	2024-04-21 00:15:21	2024-04-21 00:16:21	00:01:00	1
test 1	2024-04-21 00:16:21	2024-04-21 00:17:21	00:01:00	1
test 1	2024-04-21 00:17:21	2024-04-21 00:23:21	00:00:00	0
test 2	2024-04-22 00:00:21	2024-04-22 00:01:21	00:01:00	1
test 2	2024-04-22 00:01:21	2024-04-22 00:02:21	00:01:00	1

expected result; it should result in one record per uninterrupted ongoing_sequence with the sum of the difference.

name	min_start	max_end	difference
test 1	2024-04-21 00:00:21	2024-04-21 00:03:21	00:03:00
test 1	2024-04-21 00:15:21	2024-04-21 00:17:21	00:02:00
test 2	2024-04-22 00:00:21	2024-04-22 00:02:21	00:02:00

Could you please give me some advice on how to achieve it in a performant way directly in SQL? Iterating the results and creating the output in that way is not an option.

Thanks in advance!

Similar posts exist, like group by time period but this does not lead to the expected results.

Solution

You can use a rolling (stepping, tumbling) count()over window1. Demo at db<>fiddle:

select name
      ,min("start")    as max_start
      ,max("end")      as max_end
      ,sum(difference) as difference
from(select *,count(*)filter(where ongoing_sequence=0)over(w1) as seq_no
     from test
     window w1 as (partition by name order by "start"))_
where ongoing_sequence<>0
group by name,seq_no
order by name;

name	max_start	max_end	difference
test 1	2024-04-21 00:00:21	2024-04-21 00:03:21	00:03:00
test 1	2024-04-21 00:15:21	2024-04-21 00:17:21	00:02:00
test 2	2024-04-22 00:00:21	2024-04-22 00:02:21	00:02:00

The order by "start" implies range between unbounded preceding and current row which means each row looks back and counts how many gaps there were so far. Each uninterrupted sequence of rows will share that count - if there was a gap, it'd get incremented.
In the outer query you just ignore the gap rows and add that shared sequence number to the group by "name".

The count()filter(where X) means sum(case when X then 1 else 0 end) or count(case when X then 1 end) but it's a bit nicer.

It's called a gaps-and-islands problem.