postgresql sql-execution-plan database-indexes

Filter ruins performance of index scan when paginating while row-comparing multi-column index to a row fetched in a subquery

I want to implement efficient pagination in Postgresql (14.5): give me a few rows associated with some a, that come immediately after some b. Unfortunately Index Cond disappears from under the Index Only Scan as soon as I try to fetch the a and b values from the database itself. Is there a way to avoid it?

The last query in this long list is doing what I want, but slowly. Everything else is presented for completeness to showcase different aspects of the problem.

Creating table and data

CREATE TABLE foo (id SERIAL PRIMARY KEY, a INT NOT NULL, b INT NOT NULL);
CREATE UNIQUE INDEX foo_a_b ON foo (a, b);

INSERT INTO foo (a, b) 
SELECT *
FROM generate_series(0, 9) a_table 
CROSS JOIN generate_series(1, 100000) b_table;

Static starting point, row comparison, lower bound

EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5, 50000) < (a, b)
ORDER BY a, b
LIMIT 3;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..0.69 rows=3 width=8) (actual time=0.051..0.056 rows=3 loops=1)
   ->  Index Only Scan using foo_a_b on foo  (cost=0.43..31717.57 rows=367608 width=8) (actual time=0.049..0.052 rows=3 loops=1)
         Index Cond: (ROW(a, b) > ROW(5, 50000))
         Heap Fetches: 3
 Planning Time: 0.171 ms
 Execution Time: 0.113 ms

Lightning fast!

Static starting point, reversed row comparison, lower bound

EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (a, b) > (5, 50000)
ORDER BY a, b
LIMIT 3;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.51 rows=3 width=8) (actual time=0.048..0.051 rows=3 loops=1)
   ->  Index Only Scan using foo_a_b on foo  (cost=0.42..11350.75 rows=398533 width=8) (actual time=0.046..0.048 rows=3 loops=1)
         Index Cond: (ROW(a, b) > ROW(5, 50000))
         Heap Fetches: 0
 Planning Time: 0.169 ms
 Execution Time: 0.089 ms

Still lightning fast!

Static starting point, element, lower bound

EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5 = a AND 50000 < b) OR 5 < a
ORDER BY a, b
LIMIT 3;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.66 rows=3 width=8) (actual time=68.243..68.244 rows=3 loops=1)
   ->  Index Only Scan using foo_a_b on foo  (cost=0.42..33480.43 rows=428662 width=8) (actual time=68.241..68.242 rows=3 loops=1)
         Filter: (((5 = a) AND (50000 < b)) OR (5 < a))
         Rows Removed by Filter: 550000
         Heap Fetches: 0
 Planning Time: 0.216 ms
 Execution Time: 68.277 ms

Filter kills performance. Lesson learned: optimizer in to smart enough, so just stick to row comparisons.

Inline starting point, row comparison, lower bound

EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (SELECT ROW(a, b) FROM foo WHERE id = 550000) < (a, b)
ORDER BY a, b
LIMIT 3
;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8.87..9.12 rows=3 width=8) (actual time=137.267..137.268 rows=3 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using foo_pkey on foo foo_1  (cost=0.42..8.44 rows=1 width=32) (actual time=0.021..0.024 rows=1 loops=1)
           Index Cond: (id = 550000)
   ->  Index Only Scan using foo_a_b on foo  (cost=0.42..28480.42 rows=333333 width=8) (actual time=137.264..137.265 rows=3 loops=1)
         Filter: ($0 < ROW(a, b))
         Rows Removed by Filter: 550000
         Heap Fetches: 0
 Planning Time: 0.219 ms
 Execution Time: 137.310 ms

Filter kills performance.

Inline starting point, reversed row comparison, lower bound

EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (a, b) > (SELECT ROW(a, b) FROM foo WHERE id = 550000)
ORDER BY a, b
LIMIT 3


ERROR:  subquery has too few columns
LINE 3: WHERE  (a, b) > (SELECT ROW(a, b) FROM foo WHERE id = 550000)

Failure. A wat moment for me. Could be a bug.

Static starting point, row comparison, lower bound and upper bound

EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5, 50000) < (a, b) AND (a, b) < (440, 35)
ORDER BY a, b
LIMIT 3;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.52 rows=3 width=8) (actual time=0.055..0.058 rows=3 loops=1)
   ->  Index Only Scan using foo_a_b on foo  (cost=0.42..12347.08 rows=398533 width=8) (actual time=0.052..0.054 rows=3 loops=1)
         Index Cond: ((ROW(a, b) > ROW(5, 50000)) AND (ROW(a, b) < ROW(440, 35)))
         Heap Fetches: 0
 Planning Time: 0.224 ms
 Execution Time: 0.104 ms

Lightning fast still!

Static starting point, row comparison, fixed first column and lower bounded second column

EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5, 50000) < (a, b) AND a = 5
ORDER BY a, b
LIMIT 3;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.52 rows=3 width=8) (actual time=4.579..4.581 rows=3 loops=1)
   ->  Index Only Scan using foo_a_b on foo  (cost=0.42..1237.23 rows=39840 width=8) (actual time=4.576..4.578 rows=3 loops=1)
         Index Cond: ((ROW(a, b) > ROW(5, 50000)) AND (a = 5))
         Heap Fetches: 0
 Planning Time: 0.213 ms
 Execution Time: 4.622 ms

Much slower, but tolerable. Optimizer is smart enough to combine and equality and the row comparison together and to preserve the Index Cond.

Starting point from `WITH` clause, row comparison, fixed first column and lower bounded second column

EXPLAIN (ANALYZE)
WITH starting_point AS (SELECT ROW(a, b) r FROM foo WHERE id = 550000)
SELECT a, b
FROM foo
WHERE (SELECT r FROM starting_point) < (a, b) AND a = (SELECT a FROM starting_point)
ORDER BY a, b
LIMIT 3;
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8.89..100.63 rows=3 width=8) (actual time=136.362..136.365 rows=3 loops=1)
   CTE starting_point
     ->  Index Scan using foo_pkey on foo foo_1  (cost=0.42..8.44 rows=1 width=32) (actual time=0.025..0.027 rows=1 loops=1)
           Index Cond: (id = 550000)
   InitPlan 2 (returns $1)
     ->  CTE Scan on starting_point  (cost=0.00..0.02 rows=1 width=32) (actual time=0.029..0.032 rows=1 loops=1)
   ->  Index Only Scan using foo_a_b on foo  (cost=0.42..50980.43 rows=1667 width=8) (actual time=136.359..136.361 rows=3 loops=1)
         Filter: (($1 < ROW(a, b)) AND (a = (SubPlan 3)))
         Rows Removed by Filter: 550000
         Heap Fetches: 0
         SubPlan 3
           ->  CTE Scan on starting_point starting_point_1  (cost=0.00..0.02 rows=1 width=4) (actual time=0.000..0.001 rows=1 loops=3)
 Planning Time: 0.311 ms
 Execution Time: 136.423 ms

This is the query that I want to run faster.

UPD: Slight improvement.


EXPLAIN (ANALYZE)
WITH starting_point AS (SELECT ROW(a, b) r, a FROM foo WHERE id = 550000)
SELECT a, b
FROM foo
WHERE (SELECT r FROM starting_point) < (a, b) AND a = (SELECT a FROM starting_point)
ORDER BY a, b
LIMIT 3;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8.91..9.19 rows=3 width=8) (actual time=41.532..41.535 rows=3 loops=1)
   CTE starting_point
     ->  Index Scan using foo_pkey on foo foo_1  (cost=0.42..8.44 rows=1 width=36) (actual time=0.036..0.038 rows=1 loops=1)
           Index Cond: (id = 550000)
   InitPlan 2 (returns $1)
     ->  CTE Scan on starting_point  (cost=0.00..0.02 rows=1 width=32) (actual time=0.001..0.002 rows=1 loops=1)
   InitPlan 3 (returns $2)
     ->  CTE Scan on starting_point starting_point_1  (cost=0.00..0.02 rows=1 width=4) (actual time=0.041..0.044 rows=1 loops=1)
   ->  Index Only Scan using foo_a_b on foo  (cost=0.42..3100.43 rows=33333 width=8) (actual time=41.529..41.530 rows=3 loops=1)
         Index Cond: (a = $2)
         Filter: ($1 < ROW(a, b))
         Rows Removed by Filter: 50000
         Heap Fetches: 0
 Planning Time: 0.314 ms
 Execution Time: 41.598 ms

For some reason this is somewhat faster.

Solution

Building on the answer of jjanes, here's what I ended up having:

EXPLAIN (ANALYZE)
WITH starting_point AS (SELECT a, b FROM foo WHERE id = 550000)
SELECT a, b
FROM foo
WHERE
  (a, b) > (SELECT * FROM starting_point)
  AND a <= (SELECT a FROM starting_point)
ORDER BY a, b
LIMIT 3;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8.91..9.00 rows=3 width=8) (actual time=0.096..0.100 rows=3 loops=1)
   CTE starting_point
     ->  Index Scan using foo_pkey on foo foo_1  (cost=0.42..8.44 rows=1 width=8) (actual time=0.023..0.025 rows=1 loops=1)
           Index Cond: (id = 550000)
   InitPlan 2 (returns $1,$2)
     ->  CTE Scan on starting_point  (cost=0.00..0.02 rows=1 width=8) (actual time=0.027..0.030 rows=1 loops=1)
   InitPlan 3 (returns $3)
     ->  CTE Scan on starting_point starting_point_1  (cost=0.00..0.02 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1)
   ->  Index Only Scan using foo_a_b on foo  (cost=0.42..3442.65 rows=111111 width=8) (actual time=0.093..0.095 rows=3 loops=1)
         Index Cond: ((ROW(a, b) > ROW($1, $2)) AND (a <= $3))
         Heap Fetches: 0
 Planning Time: 0.566 ms
 Execution Time: 0.160 ms

Learnings:

Subqueries in the where clause must be on the right of the comparison for some reason, otherwise the optimizer gets confused.
ROW() should be avoided, otherwise the planner fails.
Both conditions must be inequalities, otherwise the performance suffers.