Search code examples
sqlpostgresqlloggingevents

How to split events into periods of activity in postgres?


I have an events table that I'm using as a log for a distributed computing system.

It looks like so:

CREATE TABLE events(
  id SERIAL PRIMARY KEY,
  occurred_at timestamp with time zone DEFAULT now()
);

There's some additional, irrelevant info within the table that I have excluded for simplicity.

The system usually operates with periods of much activity followed by very little activity, and I'd like to identify these periods retroactively.

Ideally there would be some function identify_activity_periods(tolerance text) that would output these periods. tolerance here represents a valid textual description of a period of time, as used in the Postgres date diffing, eg '1h' or '3m'

To find them manually, you would have to sort every event by time, and cluster all events that happen within tolerance of the previous/next events in order into the same activity period. This would return a table with these unique activity periods, including the starting event and final event that bookend the period (which you could then use to expose the start and end times in a view).

This might be a fairly easy grouping, but its expression has eluded me (along with my attempts to find similar questions).

How would you do this?

If the events table had data like

id: time stamp
3: now()
2: now()-'1m'
1: now()-'1h'

then you would expect identify_activity_periods('10m') to return two activity periods:

id: start_event, end_event
2: 2, 3
1: 1, 1

and you would expect identify_activity_periods('2h') to return

id: start_event, end_event
1: 1, 3

Solution

    1. You can start by ordering your events and ranking each of them in ordered_events CTE
    2. If the time difference between the current event and the previous event is within tolerance, it is part of the same cluster (cluster_start is 0). Otherwise, it starts a new cluster (cluster_start is 1) in clusters CTE
    3. A unique id is assigned to each cluster of events in cluster_groups CTE
    4. start and end events for each cluster is identified in activity_periods CTE
    5. Finally unique id is assigned to each activity period.

    Fiddle

    CREATE OR REPLACE FUNCTION identify_activity_periods(tolerance INTERVAL)
    RETURNS TABLE(id INT, start_event INT, end_event INT) AS $$
    BEGIN
      RETURN QUERY
      WITH ordered_events AS (
        SELECT 
          e.id AS event_id, 
          e.occurred_at, 
          ROW_NUMBER() OVER (ORDER BY e.occurred_at)::INT AS rn
        FROM events e
      ),
      clusters AS (
        SELECT 
          oe.event_id, 
          oe.occurred_at, 
          CASE 
            WHEN oe.occurred_at - LAG(oe.occurred_at) OVER (ORDER BY oe.occurred_at) <= tolerance THEN 0 
            ELSE 1 
          END AS cluster_start,
          ROW_NUMBER() OVER (ORDER BY oe.occurred_at)::INT AS rn
        FROM ordered_events oe
      ),
      cluster_groups AS (
        SELECT 
          c.event_id, 
          c.occurred_at, 
          SUM(c.cluster_start) OVER (ORDER BY c.occurred_at)::INT AS cluster_id
        FROM clusters c
      ),
      activity_periods AS (
        SELECT 
          cg.cluster_id,
          MIN(cg.event_id) AS start_event,
          MAX(cg.event_id) AS end_event
        FROM cluster_groups cg
        GROUP BY cg.cluster_id
      )
      SELECT 
        ROW_NUMBER() OVER (ORDER BY ap.cluster_id)::INT AS id,
        ap.start_event, 
        ap.end_event
      FROM activity_periods ap
      ORDER BY ap.start_event;
    END;
    $$ LANGUAGE plpgsql;
    

    Output

    SELECT * FROM identify_activity_periods('10 minutes');
    
    id start_event end_event
    1 1 1
    2 2 3
    SELECT * FROM identify_activity_periods('2 hours');
    
    id start_event end_event
    1 1 3