Search code examples
sqlpostgresqlpostgresql-11sql-limit

Equivalent for FETCH FIRST WITH TIES in Postgres 11 with comparable performance


Given the following DDL

CREATE TABLE numbers (
    id     BIGSERIAL PRIMARY KEY,
    number BIGINT
);
CREATE INDEX ON numbers (number);

In PostgreSQL 13 I can use the following query:

SELECT *
FROM numbers
WHERE number > 1
ORDER BY number
FETCH FIRST 1000 ROWS WITH TIES

It produces a very effective query plan and performs well enough with large tables:

Limit  (cost=...)                                                   
  ->  Index Scan using numbers_number_idx on numbers  (cost=...) 
        Index Cond: (number > 1)  

Is there an equivalent for that in PostgreSQL 11 that gives a similar query plan and comparable performance for large (10TB+) tables?

This answer suggests to use the following query:

WITH cte AS (
    SELECT *, RANK() OVER (ORDER BY number) AS rnk
    FROM numbers
    WHERE number > 1
)
SELECT *
FROM cte
WHERE rnk <= 1000;

But it is not really usable for large tables because performance is many times worse.

Subquery Scan on cte  (cost=...)                                       
  Filter: (cte.rnk <= 1000)                                                                       
  ->  WindowAgg  (cost=...)                                            
        ->  Index Scan using numbers_number_idx on numbers  (cost=...) 
              Index Cond: (number > 1)

Solution

  • You won't get the performance of the extremely convenient WITH TIES in Postgres 13 or later. See:

    But there are always options ...

    Faster SQL variant

    Should be faster for big tables because it avoids to rank the whole table like the simple solution in the referenced answer.

    WITH cte AS (
       SELECT *, row_number() OVER (ORDER  BY number, id) AS rn  -- ORDER must match
       FROM   numbers
       WHERE  number > 1
       ORDER  BY number, id  -- tiebreaker to make it deterministic
       LIMIT  1000
       )
    TABLE cte
    UNION ALL
    (
    SELECT n.*, 1000 + row_number() OVER (ORDER BY n.id) AS rn
    FROM   numbers n, (SELECT number, id FROM cte WHERE rn = 1000) AS max
    WHERE  n.number = max.number
    AND    n.id > max.id
    ORDER  BY n.id
    );
    

    id is any UNIQUE NOT NULL column in your table that lends itself as tiebreaker, typically the PRIMARY KEY.

    Sort by that additionally, which has the (maybe welcome) side effect that you get a deterministically ordered result.

    Ideally, you have a multicolumn index on (number, id) for this. But it will use the existing index on just (number), too. Resulting performance depends on cardinalities and data types.
    Related:

    Since the CTE query counts as one command, there is no race condition under concurrent write load. All parts of the command see the same snapshot of the table in default isolation level READ COMMITTED.

    I might wrap this into a simple SQL function (or prepared statement) and pass LIMIT to it for convenience.

    Alternative with PL/pgSQL

    Only running a single SELECT probably outweighs the added overhead. Works optimally with the existing index on just (number):

    CREATE OR REPLACE FUNCTION foo(_bound int, _limit int, OUT _row public.numbers)
      RETURNS SETOF public.numbers
      LANGUAGE plpgsql AS
    $func$
    DECLARE
       _ct int := 0;
       _number int;  -- match type!
    BEGIN
       FOR _row IN
          SELECT *
          FROM   public.numbers
          WHERE  number > _bound
          ORDER  BY number
       LOOP
          _ct := _ct + 1;
          CASE 
          WHEN _ct < _limit THEN
             -- do nothing
          WHEN _ct > _limit THEN
             EXIT WHEN _row.number > _number;
          WHEN _ct = _limit THEN
             _number := _row.number;   
          END CASE;
    
          RETURN NEXT;
       END LOOP;
    END
    $func$;
    

    But it's tailored for just the one query. Gets tedious for varying queries.