I am trying to write a query to generate fixed non-overlapping windows and counting the number of records within each window. The next window starts on the next date of record after the previous window's end date. There are gaps between dates. I think this a form of gaps and islands problem. I am using PostgresSQL.
Here are the constraints
The issue I am having is how to determine the next window start given the time gap after the last window ends.
Here is example data:
id|ts |
--+-----------------------+
1|2024-12-20 21:48:24.877|
1|2025-01-04 03:17:32.757|
1|2025-01-17 20:14:57.942|
2|2025-01-02 22:57:29.979|
2|2025-01-15 16:16:17.941|
2|2025-01-16 16:25:20.665|
2|2025-01-29 16:17:04.410|
2|2025-01-30 16:26:21.598|
3|2024-12-19 20:33:39.793|
3|2024-12-28 06:44:24.236|
3|2024-12-31 05:13:19.438|
3|2025-01-03 10:14:29.228|
3|2025-01-09 18:11:22.303|
3|2025-01-10 18:32:00.508|
3|2025-01-12 20:21:10.596|
3|2025-01-16 17:40:39.347|
The expected output:
id|window_start |window_end |count|
--+-----------------------+-----------------------+-----|
1|2024-12-20 21:48:24.877|2024-12-27 21:48:24.877| 1|
1|2025-01-04 03:17:32.757|2025-01-11 03:17:32.757| 1|
1|2025-01-17 20:14:57.942|2025-01-24 20:14:57.942| 1|
2|2025-01-02 22:57:29.979|2025-01-09 22:57:29.979| 1|
2|2025-01-15 22:57:29.979|2025-01-22 22:57:29.979| 2|
2|2025-01-29 22:57:29.979|2025-02-05 22:57:29.979| 2|
3|2024-12-19 20:33:39.793|2024-12-26 20:33:39.793| 1|
3|2024-12-28 06:44:24.236|2025-01-04 06:44:24.236| 3|
3|2025-01-09 18:11:22.303|2025-01-16 18:11:22.303| 4|
Decomposing this reasoning with CTEs gives the SQL below.
Note that to avoid computing the "next" inside the recursive CTE,
I prefered precomputing the "next" for each entry "in case it later became an island start".
with recursive
-- Identify unambiguous window starts: those with no previous ts in the 7 previous days.
maybe as
(
select
*,
row_number() over w num, -- Will ease our reading of results.
-- startpoint:
-- - true: confirmed start of a new window
-- - null: maybe, maybe not; will be later decided (depending on if the previous ts (nearer than 7 days ago), has itself been eaten by a previous window (thus let us be a new start) or not (then the previous is a start and we're eaten by it)).
case when lag(ts) over w >= ts - interval '7d' then null else true end startpoint
from ts
window w as (partition by id order by ts)
),
-- Continents of ts never more than 7 days far one from another.
c as
(
select
*,
-- Identify it by the num of the unambiguous starting point.
max(case when startpoint then num end) over w continent,
-- Now attributes for *hypothetical* new island starts:
-- for each ts, choose its successor in case this one becomes an island start
-- The successor is the first row from the same continent, but further than 7 days from this one.
min(num) over (w range between '7d' following and unbounded following) successor,
-- Number of rows which would belong to this 7 days window (in case the current row is a window start).
count(1) over (w range between current row and '7d' following) n_included
from maybe
window w as (partition by id order by ts)
),
-- Now iterate starting from the continents,
-- to see if we can determine islands within them.
-- (each start of island "eats" the 7 following days, so the first row after 7 days can be elected as the start of a new island)
i as
(
select * from c where startpoint
union
select next.*
-- Do not forget to filter on island, as successor has been computed without this criteria (before we had determined islands).
from i join c next using (id, continent)
where next.num = i.successor
)
select * from i order by id, num;