Search code examples
sqlpostgresqlgaps-and-islands

Postgres Fixed 7-day window shifting data


I am trying to write a query to generate fixed non-overlapping windows and counting the number of records within each window. The next window starts on the next date of record after the previous window's end date. There are gaps between dates. I think this a form of gaps and islands problem. I am using PostgresSQL.

Here are the constraints

  • The data is partitioned by id, ts
  • Fixed 7 day window
  • The windows are non-overlapping
  • The next window starts on the first day of record after the end date of the previous window

The issue I am having is how to determine the next window start given the time gap after the last window ends.

Here is example data:

id|ts                     |
--+-----------------------+
 1|2024-12-20 21:48:24.877|
 1|2025-01-04 03:17:32.757|
 1|2025-01-17 20:14:57.942|
 2|2025-01-02 22:57:29.979|
 2|2025-01-15 16:16:17.941|
 2|2025-01-16 16:25:20.665|
 2|2025-01-29 16:17:04.410|
 2|2025-01-30 16:26:21.598|
 3|2024-12-19 20:33:39.793|
 3|2024-12-28 06:44:24.236|
 3|2024-12-31 05:13:19.438|
 3|2025-01-03 10:14:29.228|
 3|2025-01-09 18:11:22.303|
 3|2025-01-10 18:32:00.508|
 3|2025-01-12 20:21:10.596|
 3|2025-01-16 17:40:39.347|

The expected output:

id|window_start           |window_end             |count|
--+-----------------------+-----------------------+-----|
 1|2024-12-20 21:48:24.877|2024-12-27 21:48:24.877|    1|
 1|2025-01-04 03:17:32.757|2025-01-11 03:17:32.757|    1|
 1|2025-01-17 20:14:57.942|2025-01-24 20:14:57.942|    1|
 2|2025-01-02 22:57:29.979|2025-01-09 22:57:29.979|    1|
 2|2025-01-15 22:57:29.979|2025-01-22 22:57:29.979|    2|
 2|2025-01-29 22:57:29.979|2025-02-05 22:57:29.979|    2|
 3|2024-12-19 20:33:39.793|2024-12-26 20:33:39.793|    1|
 3|2024-12-28 06:44:24.236|2025-01-04 06:44:24.236|    3|
 3|2025-01-09 18:11:22.303|2025-01-16 18:11:22.303|    4|

Solution

    • First you can identify "undoubtful window starts"
      as those clear of another ts in the preceding 7 days
    • Then create "continents", that is, group timestamps until a 7-day hole arises, first without trying to determine where the 7-day cuts occur.
    • Now in those continents, split in islands by iterating:
      Starting from the continent start, eat everything within 7 days (counting entries as the window count), then push the next entry as a new island start. Which will eat everything within 7 days, and so on.

    Decomposing this reasoning with CTEs gives the SQL below.
    Note that to avoid computing the "next" inside the recursive CTE,
    I prefered precomputing the "next" for each entry "in case it later became an island start".

    with recursive
      -- Identify unambiguous window starts: those with no previous ts in the 7 previous days.
      maybe as
      (
        select
          *,
          row_number() over w num, -- Will ease our reading of results.
          -- startpoint:
          -- - true: confirmed start of a new window
          -- - null: maybe, maybe not; will be later decided (depending on if the previous ts (nearer than 7 days ago), has itself been eaten by a previous window (thus let us be a new start) or not (then the previous is a start and we're eaten by it)).
          case when lag(ts) over w >= ts - interval '7d' then null else true end startpoint
        from ts
        window w as (partition by id order by ts)
      ),
      -- Continents of ts never more than 7 days far one from another.
      c as
      (
        select
          *,
          -- Identify it by the num of the unambiguous starting point.
          max(case when startpoint then num end) over w continent,
          -- Now attributes for *hypothetical* new island starts:
          -- for each ts, choose its successor in case this one becomes an island start
          -- The successor is the first row from the same continent, but further than 7 days from this one.
          min(num) over (w range between '7d' following and unbounded following) successor,
          -- Number of rows which would belong to this 7 days window (in case the current row is a window start).
          count(1) over (w range between current row and '7d' following) n_included
        from maybe
        window w as (partition by id order by ts)
      ),
      -- Now iterate starting from the continents,
      -- to see if we can determine islands within them.
      -- (each start of island "eats" the 7 following days, so the first row after 7 days can be elected as the start of a new island)
      i as
      (
        select * from c where startpoint
        union
        select next.*
        -- Do not forget to filter on island, as successor has been computed without this criteria (before we had determined islands).
        from i join c next using (id, continent)
        where next.num = i.successor
      )
    select * from i order by id, num;
    

    Demo in a Fiddle