Search code examples
sqlpostgresqlwindow-functionsgaps-and-islands

PostgreSQL cluster uninterrupted series of data


here's an example of the input and the desired output, I think this is easier to grasp than me describing the need.

  • input data (business logic explanation: everything with a duration > 1M is set to 0M)
name start end difference ongoing_sequence
test 1 2024-04-21 00:00:21 2024-04-21 00:01:21 00:01:00 1
test 1 2024-04-21 00:01:21 2024-04-21 00:02:21 00:01:00 1
test 1 2024-04-21 00:02:21 2024-04-21 00:03:21 00:01:00 1
test 1 2024-04-21 00:03:21 2024-04-21 00:15:21 00:00:00 0
test 1 2024-04-21 00:15:21 2024-04-21 00:16:21 00:01:00 1
test 1 2024-04-21 00:16:21 2024-04-21 00:17:21 00:01:00 1
test 1 2024-04-21 00:17:21 2024-04-21 00:23:21 00:00:00 0
test 2 2024-04-22 00:00:21 2024-04-22 00:01:21 00:01:00 1
test 2 2024-04-22 00:01:21 2024-04-22 00:02:21 00:01:00 1
  • expected result; it should result in one record per uninterrupted ongoing_sequence with the sum of the difference.
name min_start max_end difference
test 1 2024-04-21 00:00:21 2024-04-21 00:03:21 00:03:00
test 1 2024-04-21 00:15:21 2024-04-21 00:17:21 00:02:00
test 2 2024-04-22 00:00:21 2024-04-22 00:02:21 00:02:00

Could you please give me some advice on how to achieve it in a performant way directly in SQL? Iterating the results and creating the output in that way is not an option.

Thanks in advance!

Similar posts exist, like group by time period but this does not lead to the expected results.


Solution

  • You can use a rolling (stepping, tumbling) count()over window1. Demo at db<>fiddle:

    select name
          ,min("start")    as max_start
          ,max("end")      as max_end
          ,sum(difference) as difference
    from(select *,count(*)filter(where ongoing_sequence=0)over(w1) as seq_no
         from test
         window w1 as (partition by name order by "start"))_
    where ongoing_sequence<>0
    group by name,seq_no
    order by name;
    
    name max_start max_end difference
    test 1 2024-04-21 00:00:21 2024-04-21 00:03:21 00:03:00
    test 1 2024-04-21 00:15:21 2024-04-21 00:17:21 00:02:00
    test 2 2024-04-22 00:00:21 2024-04-22 00:02:21 00:02:00

    The order by "start" implies range between unbounded preceding and current row which means each row looks back and counts how many gaps there were so far. Each uninterrupted sequence of rows will share that count - if there was a gap, it'd get incremented.
    In the outer query you just ignore the gap rows and add that shared sequence number to the group by "name".

    The count()filter(where X) means sum(case when X then 1 else 0 end) or count(case when X then 1 end) but it's a bit nicer.

    It's called a gaps-and-islands problem.