How to split events into periods of activity in postgres?

I have an events table that I'm using as a log for a distributed computing system.

It looks like so:

CREATE TABLE events(
  id SERIAL PRIMARY KEY,
  occurred_at timestamp with time zone DEFAULT now()
);

There's some additional, irrelevant info within the table that I have excluded for simplicity.

The system usually operates with periods of much activity followed by very little activity, and I'd like to identify these periods retroactively.

Ideally there would be some function identify_activity_periods(tolerance text) that would output these periods. tolerance here represents a valid textual description of a period of time, as used in the Postgres date diffing, eg '1h' or '3m'

To find them manually, you would have to sort every event by time, and cluster all events that happen within tolerance of the previous/next events in order into the same activity period. This would return a table with these unique activity periods, including the starting event and final event that bookend the period (which you could then use to expose the start and end times in a view).

This might be a fairly easy grouping, but its expression has eluded me (along with my attempts to find similar questions).

How would you do this?

If the events table had data like

id: time stamp
3: now()
2: now()-'1m'
1: now()-'1h'

then you would expect identify_activity_periods('10m') to return two activity periods:

id: start_event, end_event
2: 2, 3
1: 1, 1

and you would expect identify_activity_periods('2h') to return

id: start_event, end_event
1: 1, 3

Solution

You can start by ordering your events and ranking each of them in ordered_events CTE
If the time difference between the current event and the previous event is within tolerance, it is part of the same cluster (cluster_start is 0). Otherwise, it starts a new cluster (cluster_start is 1) in clusters CTE
A unique id is assigned to each cluster of events in cluster_groups CTE
start and end events for each cluster is identified in activity_periods CTE
Finally unique id is assigned to each activity period.

Fiddle

CREATE OR REPLACE FUNCTION identify_activity_periods(tolerance INTERVAL)
RETURNS TABLE(id INT, start_event INT, end_event INT) AS $$
BEGIN
  RETURN QUERY
  WITH ordered_events AS (
    SELECT 
      e.id AS event_id, 
      e.occurred_at, 
      ROW_NUMBER() OVER (ORDER BY e.occurred_at)::INT AS rn
    FROM events e
  ),
  clusters AS (
    SELECT 
      oe.event_id, 
      oe.occurred_at, 
      CASE 
        WHEN oe.occurred_at - LAG(oe.occurred_at) OVER (ORDER BY oe.occurred_at) <= tolerance THEN 0 
        ELSE 1 
      END AS cluster_start,
      ROW_NUMBER() OVER (ORDER BY oe.occurred_at)::INT AS rn
    FROM ordered_events oe
  ),
  cluster_groups AS (
    SELECT 
      c.event_id, 
      c.occurred_at, 
      SUM(c.cluster_start) OVER (ORDER BY c.occurred_at)::INT AS cluster_id
    FROM clusters c
  ),
  activity_periods AS (
    SELECT 
      cg.cluster_id,
      MIN(cg.event_id) AS start_event,
      MAX(cg.event_id) AS end_event
    FROM cluster_groups cg
    GROUP BY cg.cluster_id
  )
  SELECT 
    ROW_NUMBER() OVER (ORDER BY ap.cluster_id)::INT AS id,
    ap.start_event, 
    ap.end_event
  FROM activity_periods ap
  ORDER BY ap.start_event;
END;
$$ LANGUAGE plpgsql;

Output

SELECT * FROM identify_activity_periods('10 minutes');

id	start_event	end_event
1	1	1
2	2	3

SELECT * FROM identify_activity_periods('2 hours');

id	start_event	end_event
1	1	3