Search code examples
sqlpostgresqlintervalsgaps-and-islands

Get combined intervals for overlapping sessions


I want to calculate session intervals for multiple, possibly overlapping sessions per agent. I have a log table like this in a PostgreSQL database:

id  dt                  session_id  agent_id state
7   2024-01-25 22:26:57 4           3148    0
7   2024-01-25 22:24:57 4           3148    1
6   2024-01-25 22:23:57 2           3148    0
5   2024-01-25 15:53:30 1           3148    0
4   2024-01-25 15:53:30 3           3148    0
3   2024-01-25 13:53:02 3           3148    1
2   2024-01-25 12:43:10 2           3148    1
1   2024-01-25 12:30:02 1           3148    1

state = 1 indicates the start of a session, state = 0 its end.
For single sessions I use the lag() function:

select agent_id, status, dt, lag(dt) over (partition by (agent_id) order by dt) as pre_status_dt  from 
(
    select *, 
      lag(status) over (partition by (agent_id) order by dt) as pre_status 
    from (
        select cs.dt, cs.user_id as agent_id, cs.status
        from "channel_subscribe" cs 
        where cs.channel = 14
    ) as t1
    where t1.dt >= '2024-01-01'
) as t2
where t2.status != t2.pre_status
order by dt asc

But that does not work for overlapping sessions.

I would like to get this result:

agent_id  start_dt             end_dt
3148      2024-01-25 12:30:02  2024-01-25 22:23:57
3148      2024-01-25 22:24:57  2024-01-25 22:26:57

Solution

  • Your sample shows nested sessions: Additional sessions start and end within the first open session of the same agent. Alternatively, sessions might be chained: next session starts within the first session, but ends later.

    The following solutions cover both cases: A combined session starts with the first open session, and ends once all open sessions have been closed.

    Each assumes consistent data: Every session that gets closed has been opened before (or at the same time, but with am earlier id). Unfinished sessions are reported as such (with null for end_dt). If your data is not as reliable, you need to define what can go wrong, and handle cases accordingly.

    Pure SQL

    SELECT agent_id
         , min(dt) AS start_dt
         , max(dt) FILTER (WHERE sum_state = 0) AS end_dt  -- !
    FROM  (
       SELECT *
            , count(*) FILTER (WHERE sum_state = 0)
              OVER (PARTITION BY agent_id ORDER BY dt DESC, id DESC) AS island
       FROM  (
          SELECT *
               , sum(CASE state WHEN 0 THEN -1 ELSE 1 END)
                 OVER (PARTITION BY agent_id ORDER BY dt, id) sum_state
          FROM   channel_subscribe
          ) sub1
       ) sub2
    GROUP  BY agent_id, island
    ORDER  BY agent_id, island DESC;
    

    The query in sub1 adds a running count per agent as sum_state. A combined session ends when sum_state goes back to 0.

    Subquery sub2 forms groups (island) by counting "0"-events per agent. Note the descending sort (DESC) to include each "0"-event in its group.

    The outer query reports data as requested for every combined session. The added FILTER (WHERE sum_state = 0) makes sure that unfinished sessions are reported as such.

    Procedural solution

    Once we need multiple subquery levels (with changing sort order), it's probably faster to use a (combined) procedural solution like this PL/pgSQL function:

    CREATE OR REPLACE FUNCTION f_combined_sessions()
      RETURNS TABLE (agent_id int, start_dt timestamp, end_dt timestamp)
      LANGUAGE plpgsql AS
    $func$
    DECLARE
       r            record;
       _new_session bool;
       _agent_id    int;
    BEGIN
       FOR r IN
          SELECT c.agent_id, c.dt
               , sum(CASE c.state WHEN 0 THEN -1 ELSE 1 END)
                 OVER (PARTITION BY c.agent_id ORDER BY c.dt, c.id) AS sum_state
          FROM   channel_subscribe c
          ORDER  BY c.agent_id,  c.dt, c.id
       LOOP
          IF agent_id = r.agent_id THEN  -- same agent
             IF _new_session = true THEN
                start_dt     := r.dt;
                _new_session := false;
             END IF;
          ELSE                           -- new agent
             -- return open session?
             IF _new_session = false THEN
                end_dt := null;
                RETURN NEXT;
             END IF;
             agent_id     := r.agent_id;
             start_dt     := r.dt;
             _new_session := false;
          END IF;
          
          IF r.sum_state = 0 THEN
             end_dt := r.dt;
             RETURN NEXT;
             _new_session := true;
          END IF;
       END LOOP;
    
       -- return last open session?
       IF _new_session = false THEN
          end_dt := null;
          RETURN NEXT;
       END IF;
    END
    $func$;
    

    Call:

    SELECT * FROM f_combined_sessions();
    

    fiddle


    Related: