I have an events table that I'm using as a log for a distributed computing system.
It looks like so:
CREATE TABLE events(
id SERIAL PRIMARY KEY,
occurred_at timestamp with time zone DEFAULT now()
);
There's some additional, irrelevant info within the table that I have excluded for simplicity.
The system usually operates with periods of much activity followed by very little activity, and I'd like to identify these periods retroactively.
Ideally there would be some function identify_activity_periods(tolerance text)
that would output these periods. tolerance
here represents a valid textual description of a period of time, as used in the Postgres date diffing, eg '1h'
or '3m'
To find them manually, you would have to sort every event by time, and cluster all events that happen within tolerance
of the previous/next events in order into the same activity period. This would return a table with these unique activity periods, including the starting event and final event that bookend the period (which you could then use to expose the start and end times in a view).
This might be a fairly easy grouping, but its expression has eluded me (along with my attempts to find similar questions).
How would you do this?
If the events table had data like
id: time stamp
3: now()
2: now()-'1m'
1: now()-'1h'
then you would expect identify_activity_periods('10m')
to return two activity periods:
id: start_event, end_event
2: 2, 3
1: 1, 1
and you would expect identify_activity_periods('2h')
to return
id: start_event, end_event
1: 1, 3
ordered_events
CTEclusters
CTEcluster_groups
CTEactivity_periods
CTECREATE OR REPLACE FUNCTION identify_activity_periods(tolerance INTERVAL)
RETURNS TABLE(id INT, start_event INT, end_event INT) AS $$
BEGIN
RETURN QUERY
WITH ordered_events AS (
SELECT
e.id AS event_id,
e.occurred_at,
ROW_NUMBER() OVER (ORDER BY e.occurred_at)::INT AS rn
FROM events e
),
clusters AS (
SELECT
oe.event_id,
oe.occurred_at,
CASE
WHEN oe.occurred_at - LAG(oe.occurred_at) OVER (ORDER BY oe.occurred_at) <= tolerance THEN 0
ELSE 1
END AS cluster_start,
ROW_NUMBER() OVER (ORDER BY oe.occurred_at)::INT AS rn
FROM ordered_events oe
),
cluster_groups AS (
SELECT
c.event_id,
c.occurred_at,
SUM(c.cluster_start) OVER (ORDER BY c.occurred_at)::INT AS cluster_id
FROM clusters c
),
activity_periods AS (
SELECT
cg.cluster_id,
MIN(cg.event_id) AS start_event,
MAX(cg.event_id) AS end_event
FROM cluster_groups cg
GROUP BY cg.cluster_id
)
SELECT
ROW_NUMBER() OVER (ORDER BY ap.cluster_id)::INT AS id,
ap.start_event,
ap.end_event
FROM activity_periods ap
ORDER BY ap.start_event;
END;
$$ LANGUAGE plpgsql;
Output
SELECT * FROM identify_activity_periods('10 minutes');
id | start_event | end_event |
---|---|---|
1 | 1 | 1 |
2 | 2 | 3 |
SELECT * FROM identify_activity_periods('2 hours');
id | start_event | end_event |
---|---|---|
1 | 1 | 3 |