Search code examples
sqlpostgresqlplpgsqlcommon-table-expressionwindow-functions

Count previous rows within range


I'd like to determine, for each row, the total number of preceding records within a given time range.

A specific example:

clone=# \d test
              Table "pg_temp_2.test"
 Column |            Type             | Modifiers 
--------+-----------------------------+-----------
 id     | bigint                      | 
 date   | timestamp without time zone | 

I'd like to know for each date the count of rows within '1 hour previous' to that date.
Can I do this with a window function? Something like (pseudo-code, not working):

SELECT id, date
     , count(*) OVER (HAVING previous_rows.date >= (date - '1 hour'::interval))  -- ?
FROM test;

I can write this by joining test against itself, as below - but this won't scale with large tables.

SELECT a.id, a.date, count(b.*)-1 
FROM test a, test b 
WHERE (b.date >= a.date - '1 hour'::interval AND b.date < a.date)
GROUP BY 1,2
ORDER BY 2;

Is this something I could do with a recursive query? Or with a regular Common Table Expression (CTE)?


Solution

  • Test case

    Using ts instead of the reserved word date as column name. Also "date" is misleading for a timestamp.

    CREATE TABLE test (
      id  bigint
    , ts  timestamp
    );
    

    Postgres 11 or newer

    Quoting the release notes for Postgres 11:

    • Add all window function framing options specified by SQL:2011 (Oliver Ford, Tom Lane)

    Specifically, allow RANGE mode to use PRECEDING and FOLLOWING to select rows having grouping values within plus or minus the specified offset. [...]

    So this is simple now. And faster than older workarounds:

    SELECT id, ts
         , count(*) OVER (ORDER BY ts RANGE '1 hour' PRECEDING EXCLUDE CURRENT ROW)
    FROM   test
    ORDER  BY ts;
    

    fiddle

    If there can be duplicates in ts, you'll have to define how to count those, and possibly do more.

    Postgres 10 or older

    You cannot do this cheaply with a plain query, CTEs and window functions - their frame definition is static, but you need a dynamic frame (depending on column values).

    Generally, you'll have to define lower and upper bound of your window carefully: The following queries exclude the current row and include the lower border.
    There is still a minor difference: the function includes previous peers of the current row, while the correlated subquery excludes them ...

    ROM - Roman's query

    Use CTEs, aggregate timestamps into an array, unnest, count ...
    While correct, performance deteriorates drastically with more than a hand full of rows. There are a couple of performance killers here. See below.

    ARR - count array elements

    I took Roman's query and tried to streamline it a bit:

    • Remove 2nd CTE which is not necessary.
    • Transform 1st CTE into subquery, which is faster.
    • Direct count() instead of re-aggregating into an array and counting with array_length().

    But array handling is expensive, and performance still deteriorates badly with more rows.

    SELECT id, ts
         , (SELECT count(*)::int - 1
            FROM   unnest(dates) x
            WHERE  x >= sub.ts - interval '1h') AS ct
    FROM (
       SELECT id, ts
            , array_agg(ts) OVER(ORDER BY ts) AS dates
       FROM   test
       ) sub;
    

    COR - correlated subquery

    You could solve it with a simple correlated subquery. A lot faster, but still ...

    SELECT id, ts
         , (SELECT count(*)
            FROM   test t1
            WHERE  t1.ts >= t.ts - interval '1h'
            AND    t1.ts < t.ts) AS ct
    FROM   test t
    ORDER  BY ts;
    

    FNC - Function

    Loop over rows in chronological order with a row_number() in plpgsql function and combine that with a cursor over the same query, spanning the desired time frame. Then we can just subtract row numbers:

    CREATE OR REPLACE FUNCTION running_window_ct(_intv interval = '1 hour')
      RETURNS TABLE (id bigint, ts timestamp, ct int)
      LANGUAGE plpgsql AS
    $func$
    DECLARE
       cur   CURSOR FOR
             SELECT t.ts + _intv AS ts1
                  , row_number() OVER (ORDER BY t.ts ROWS UNBOUNDED PRECEDING) AS rn
             FROM   test t
             ORDER  BY t.ts;
       rec   record;
       rn    int;
    BEGIN
       OPEN cur;
       FETCH cur INTO rec;
       ct := -1;  -- init
    
       FOR id, ts, rn IN
          SELECT t.id, t.ts
               , row_number() OVER (ORDER BY t.ts ROWS UNBOUNDED PRECEDING)
          FROM   test t ORDER BY t.ts
       LOOP
          IF rec.ts1 >= ts THEN
             ct := ct + 1;
          ELSE
             LOOP
                FETCH cur INTO rec;
                EXIT WHEN rec.ts1 >= ts;
             END LOOP;
             ct := rn - rec.rn;
          END IF;
    
          RETURN NEXT;
       END LOOP;
    END
    $func$;
    

    Why ROWS UNBOUNDED PRECEDING? See:

    Call with default interval of one hour:

    SELECT * FROM running_window_ct();
    

    Or with any interval:

    SELECT * FROM running_window_ct('2 hour - 3 second');
    

    fiddle
    Old sqlfiddle

    Benchmark

    With the table from above I ran a quick benchmark on my old test server: (PostgreSQL 9.1.9 on Debian).

    -- TRUNCATE test;
    INSERT INTO test
    SELECT g, '2013-08-08'::timestamp
             + g * interval '5 min'
             + random() * 300 * interval '1 min' -- halfway realistic values
    FROM   generate_series(1, 10000) g;
    
    CREATE INDEX test_ts_idx ON test (ts);
    ANALYZE test;  -- temp table needs manual analyze
    

    I varied the bold part for each run and took the best of 5 with EXPLAIN ANALYZE.

    100 rows
    ROM: 27.656 ms
    ARR: 7.834 ms
    COR: 5.488 ms
    FNC: 1.115 ms

    1000 rows
    ROM: 2116.029 ms
    ARR: 189.679 ms
    COR: 65.802 ms
    FNC: 8.466 ms

    5000 rows
    ROM: 51347 ms !!
    ARR: 3167 ms
    COR: 333 ms
    FNC: 42 ms

    100000 rows
    ROM: DNF
    ARR: DNF
    COR: 6760 ms
    FNC: 828 ms

    The function is the clear victor. It is fastest by an order of magnitude and scales best.
    Array handling cannot compete.