Search code examples
sqlpostgresqlplpgsqlcommon-table-expressionwindow-functions

Count previous rows within range


I'd like to determine, for each row, the total number of preceding records within a given time range.

A specific example:

clone=# \d test
              Table "pg_temp_2.test"
 Column |            Type             | Modifiers 
--------+-----------------------------+-----------
 id     | bigint                      | 
 date   | timestamp without time zone | 

I'd like to know for each date the count of rows within '1 hour previous' to that date.
Can I do this with a window function? Something like (pseudo-code, not working):

SELECT id, date
     , count(*) OVER (HAVING previous_rows.date >= (date - '1 hour'::interval))  -- ?
FROM test;

I can write this by joining test against itself, as below - but this won't scale with large tables.

SELECT a.id, a.date, count(b.*)-1 
FROM test a, test b 
WHERE (b.date >= a.date - '1 hour'::interval AND b.date < a.date)
GROUP BY 1,2
ORDER BY 2;

Is this something I could do with a recursive query? Or with a regular Common Table Expression (CTE)?


Solution

  • Test case

    Using ts instead of the reserved word date as column name. Also "date" is misleading for a timestamp.

    CREATE TABLE test (
      id  bigint
    , ts  timestamp
    );
    

    Postgres 11 or newer

    Quoting the release notes for Postgres 11:

    • Add all window function framing options specified by SQL:2011 (Oliver Ford, Tom Lane)

    Specifically, allow RANGE mode to use PRECEDING and FOLLOWING to select rows having grouping values within plus or minus the specified offset. [...]

    So this is simple now. And faster than older workarounds:

    SELECT id, ts
         , count(*) OVER (ORDER BY ts RANGE '1 hour' PRECEDING EXCLUDE CURRENT ROW)
    FROM   test
    ORDER  BY ts;
    

    fiddle

    If there can be duplicates in ts, you'll have to define how to count those, and possibly do more.
    Syntax details in the manual.

    Postgres 10 or older

    You cannot do this cheaply with a plain query, CTEs and window functions - their frame definition is static, but you need a dynamic frame (depending on column values).

    Generally, you'll have to define lower and upper bound of your window carefully: The following queries exclude the current row and include the lower border.
    There is still a minor difference: the function includes previous peers of the current row, while the correlated subquery excludes them ...

    ROM - Roman's query

    Use CTEs, aggregate timestamps into an array, unnest, count ...
    While correct, performance deteriorates drastically with more than a hand full of rows. There are a couple of performance killers here. See below.

    ARR - count array elements

    I took Roman's query and tried to streamline it a bit:

    • Remove 2nd CTE which is not necessary.
    • Transform 1st CTE into subquery, which is faster.
    • Direct count() instead of re-aggregating into an array and counting with array_length().

    But array handling is expensive, and performance still deteriorates badly with more rows.

    SELECT id, ts
         , (SELECT count(*)::int - 1
            FROM   unnest(dates) x
            WHERE  x >= sub.ts - interval '1h') AS ct
    FROM (
       SELECT id, ts
            , array_agg(ts) OVER(ORDER BY ts) AS dates
       FROM   test
       ) sub;
    

    COR - correlated subquery

    You could solve it with a simple correlated subquery. A lot faster, but still ...

    SELECT id, ts
         , (SELECT count(*)
            FROM   test t1
            WHERE  t1.ts >= t.ts - interval '1h'
            AND    t1.ts < t.ts) AS ct
    FROM   test t
    ORDER  BY ts;
    

    FNC - Function

    Loop over rows in chronological order with a row_number() in plpgsql function and combine that with a cursor over the same query, spanning the desired time frame. Then we can just subtract row numbers:

    CREATE OR REPLACE FUNCTION running_window_ct(_intv interval = '1 hour')
      RETURNS TABLE (id bigint, ts timestamp, ct int)
      LANGUAGE plpgsql AS
    $func$
    DECLARE
       cur   CURSOR FOR
             SELECT t.ts + _intv AS ts1
                  , row_number() OVER (ORDER BY t.ts ROWS UNBOUNDED PRECEDING) AS rn
             FROM   test t
             ORDER  BY t.ts;
       rec   record;
       rn    int;
    BEGIN
       OPEN cur;
       FETCH cur INTO rec;
       ct := -1;  -- init
    
       FOR id, ts, rn IN
          SELECT t.id, t.ts
               , row_number() OVER (ORDER BY t.ts ROWS UNBOUNDED PRECEDING)
          FROM   test t ORDER BY t.ts
       LOOP
          IF rec.ts1 >= ts THEN
             ct := ct + 1;
          ELSE
             LOOP
                FETCH cur INTO rec;
                EXIT WHEN rec.ts1 >= ts;
             END LOOP;
             ct := rn - rec.rn;
          END IF;
    
          RETURN NEXT;
       END LOOP;
    END
    $func$;
    

    Why ROWS UNBOUNDED PRECEDING? See:

    Call with default interval of one hour:

    SELECT * FROM running_window_ct();
    

    Or with any interval:

    SELECT * FROM running_window_ct('2 hour - 3 second');
    

    fiddle
    Old sqlfiddle

    Benchmark

    With the table from above I ran a quick benchmark on my old test server: (PostgreSQL 9.1.9 on Debian).

    -- TRUNCATE test;
    INSERT INTO test
    SELECT g, '2013-08-08'::timestamp
             + g * interval '5 min'
             + random() * 300 * interval '1 min' -- halfway realistic values
    FROM   generate_series(1, 10000) g;
    
    CREATE INDEX test_ts_idx ON test (ts);
    ANALYZE test;  -- temp table needs manual analyze
    

    I varied the bold part for each run and took the best of 5 with EXPLAIN ANALYZE.

    100 rows
    ROM: 27.656 ms
    ARR: 7.834 ms
    COR: 5.488 ms
    FNC: 1.115 ms

    1000 rows
    ROM: 2116.029 ms
    ARR: 189.679 ms
    COR: 65.802 ms
    FNC: 8.466 ms

    5000 rows
    ROM: 51347 ms !!
    ARR: 3167 ms
    COR: 333 ms
    FNC: 42 ms

    100000 rows
    ROM: DNF
    ARR: DNF
    COR: 6760 ms
    FNC: 828 ms

    The function is the clear victor. It is fastest by an order of magnitude and scales best.
    Array handling cannot compete.