I want to implement efficient pagination in Postgresql (14.5): give me a few rows associated with some a
, that come immediately after some b
. Unfortunately Index Cond
disappears from under the Index Only Scan
as soon as I try to fetch the a
and b
values from the database itself. Is there a way to avoid it?
The last query in this long list is doing what I want, but slowly. Everything else is presented for completeness to showcase different aspects of the problem.
CREATE TABLE foo (id SERIAL PRIMARY KEY, a INT NOT NULL, b INT NOT NULL);
CREATE UNIQUE INDEX foo_a_b ON foo (a, b);
INSERT INTO foo (a, b)
SELECT *
FROM generate_series(0, 9) a_table
CROSS JOIN generate_series(1, 100000) b_table;
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5, 50000) < (a, b)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..0.69 rows=3 width=8) (actual time=0.051..0.056 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.43..31717.57 rows=367608 width=8) (actual time=0.049..0.052 rows=3 loops=1)
Index Cond: (ROW(a, b) > ROW(5, 50000))
Heap Fetches: 3
Planning Time: 0.171 ms
Execution Time: 0.113 ms
Lightning fast!
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (a, b) > (5, 50000)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.51 rows=3 width=8) (actual time=0.048..0.051 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..11350.75 rows=398533 width=8) (actual time=0.046..0.048 rows=3 loops=1)
Index Cond: (ROW(a, b) > ROW(5, 50000))
Heap Fetches: 0
Planning Time: 0.169 ms
Execution Time: 0.089 ms
Still lightning fast!
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5 = a AND 50000 < b) OR 5 < a
ORDER BY a, b
LIMIT 3;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.66 rows=3 width=8) (actual time=68.243..68.244 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..33480.43 rows=428662 width=8) (actual time=68.241..68.242 rows=3 loops=1)
Filter: (((5 = a) AND (50000 < b)) OR (5 < a))
Rows Removed by Filter: 550000
Heap Fetches: 0
Planning Time: 0.216 ms
Execution Time: 68.277 ms
Filter kills performance. Lesson learned: optimizer in to smart enough, so just stick to row comparisons.
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (SELECT ROW(a, b) FROM foo WHERE id = 550000) < (a, b)
ORDER BY a, b
LIMIT 3
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.87..9.12 rows=3 width=8) (actual time=137.267..137.268 rows=3 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using foo_pkey on foo foo_1 (cost=0.42..8.44 rows=1 width=32) (actual time=0.021..0.024 rows=1 loops=1)
Index Cond: (id = 550000)
-> Index Only Scan using foo_a_b on foo (cost=0.42..28480.42 rows=333333 width=8) (actual time=137.264..137.265 rows=3 loops=1)
Filter: ($0 < ROW(a, b))
Rows Removed by Filter: 550000
Heap Fetches: 0
Planning Time: 0.219 ms
Execution Time: 137.310 ms
Filter kills performance.
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (a, b) > (SELECT ROW(a, b) FROM foo WHERE id = 550000)
ORDER BY a, b
LIMIT 3
ERROR: subquery has too few columns
LINE 3: WHERE (a, b) > (SELECT ROW(a, b) FROM foo WHERE id = 550000)
Failure. A wat moment for me. Could be a bug.
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5, 50000) < (a, b) AND (a, b) < (440, 35)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.52 rows=3 width=8) (actual time=0.055..0.058 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..12347.08 rows=398533 width=8) (actual time=0.052..0.054 rows=3 loops=1)
Index Cond: ((ROW(a, b) > ROW(5, 50000)) AND (ROW(a, b) < ROW(440, 35)))
Heap Fetches: 0
Planning Time: 0.224 ms
Execution Time: 0.104 ms
Lightning fast still!
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5, 50000) < (a, b) AND a = 5
ORDER BY a, b
LIMIT 3;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.52 rows=3 width=8) (actual time=4.579..4.581 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..1237.23 rows=39840 width=8) (actual time=4.576..4.578 rows=3 loops=1)
Index Cond: ((ROW(a, b) > ROW(5, 50000)) AND (a = 5))
Heap Fetches: 0
Planning Time: 0.213 ms
Execution Time: 4.622 ms
Much slower, but tolerable. Optimizer is smart enough to combine and equality and the row comparison together and to preserve the Index Cond
.
WITH
clause, row comparison, fixed first column and lower bounded second columnEXPLAIN (ANALYZE)
WITH starting_point AS (SELECT ROW(a, b) r FROM foo WHERE id = 550000)
SELECT a, b
FROM foo
WHERE (SELECT r FROM starting_point) < (a, b) AND a = (SELECT a FROM starting_point)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.89..100.63 rows=3 width=8) (actual time=136.362..136.365 rows=3 loops=1)
CTE starting_point
-> Index Scan using foo_pkey on foo foo_1 (cost=0.42..8.44 rows=1 width=32) (actual time=0.025..0.027 rows=1 loops=1)
Index Cond: (id = 550000)
InitPlan 2 (returns $1)
-> CTE Scan on starting_point (cost=0.00..0.02 rows=1 width=32) (actual time=0.029..0.032 rows=1 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..50980.43 rows=1667 width=8) (actual time=136.359..136.361 rows=3 loops=1)
Filter: (($1 < ROW(a, b)) AND (a = (SubPlan 3)))
Rows Removed by Filter: 550000
Heap Fetches: 0
SubPlan 3
-> CTE Scan on starting_point starting_point_1 (cost=0.00..0.02 rows=1 width=4) (actual time=0.000..0.001 rows=1 loops=3)
Planning Time: 0.311 ms
Execution Time: 136.423 ms
This is the query that I want to run faster.
EXPLAIN (ANALYZE)
WITH starting_point AS (SELECT ROW(a, b) r, a FROM foo WHERE id = 550000)
SELECT a, b
FROM foo
WHERE (SELECT r FROM starting_point) < (a, b) AND a = (SELECT a FROM starting_point)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.91..9.19 rows=3 width=8) (actual time=41.532..41.535 rows=3 loops=1)
CTE starting_point
-> Index Scan using foo_pkey on foo foo_1 (cost=0.42..8.44 rows=1 width=36) (actual time=0.036..0.038 rows=1 loops=1)
Index Cond: (id = 550000)
InitPlan 2 (returns $1)
-> CTE Scan on starting_point (cost=0.00..0.02 rows=1 width=32) (actual time=0.001..0.002 rows=1 loops=1)
InitPlan 3 (returns $2)
-> CTE Scan on starting_point starting_point_1 (cost=0.00..0.02 rows=1 width=4) (actual time=0.041..0.044 rows=1 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..3100.43 rows=33333 width=8) (actual time=41.529..41.530 rows=3 loops=1)
Index Cond: (a = $2)
Filter: ($1 < ROW(a, b))
Rows Removed by Filter: 50000
Heap Fetches: 0
Planning Time: 0.314 ms
Execution Time: 41.598 ms
For some reason this is somewhat faster.
Building on the answer of jjanes, here's what I ended up having:
EXPLAIN (ANALYZE)
WITH starting_point AS (SELECT a, b FROM foo WHERE id = 550000)
SELECT a, b
FROM foo
WHERE
(a, b) > (SELECT * FROM starting_point)
AND a <= (SELECT a FROM starting_point)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.91..9.00 rows=3 width=8) (actual time=0.096..0.100 rows=3 loops=1)
CTE starting_point
-> Index Scan using foo_pkey on foo foo_1 (cost=0.42..8.44 rows=1 width=8) (actual time=0.023..0.025 rows=1 loops=1)
Index Cond: (id = 550000)
InitPlan 2 (returns $1,$2)
-> CTE Scan on starting_point (cost=0.00..0.02 rows=1 width=8) (actual time=0.027..0.030 rows=1 loops=1)
InitPlan 3 (returns $3)
-> CTE Scan on starting_point starting_point_1 (cost=0.00..0.02 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..3442.65 rows=111111 width=8) (actual time=0.093..0.095 rows=3 loops=1)
Index Cond: ((ROW(a, b) > ROW($1, $2)) AND (a <= $3))
Heap Fetches: 0
Planning Time: 0.566 ms
Execution Time: 0.160 ms
Learnings:
ROW()
should be avoided, otherwise the planner fails.