I need to process the id
column (primary key, integer) of a table with millions of records (about 25M). However, i am only interested in every n-th id.
Currently a simple approach is used:
select id from big order by id;
and then every n-th id is processed by the client software (cursor based).
I was wondering if this could not be much more efficient, if the selection of every n-th id is delegated to postgresql. Tried this one:
select id from
(select id, row_number() over (order by id) from big) _
where row_number % 10000 = 0;`
However, this approach is much slower:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on _ (cost=0.44..1291886.20 rows=115413 width=4) (actual time=9.385..10731.994 rows=2308 loops=1)
Filter: ((_.row_number % '10000'::bigint) = 0)
Rows Removed by Filter: 23080220
-> WindowAgg (cost=0.44..945648.28 rows=23082528 width=12) (actual time=0.107..9450.396 rows=23082528 loops=1)
-> Index Only Scan using big_pkey on big (cost=0.44..599410.36 rows=23082528 width=4) (actual time=0.093..2403.921 rows=23082528 loops=1)
Heap Fetches: 0
Planning Time: 0.172 ms
Execution Time: 10732.229 ms
(8 rows)
The simple query has an execution time of 2721.101 ms (so almost 4 times faster).
Question: is there a better way to do this? (using PostgreSQL 11)
Create a cursor and fetch only every tenth row (also works for different intervals):
BEGIN; -- must be in a transaction
DECLARE cc CURSOR FOR
SELECT id FROM big ORDER BY id;
/* skip 9 rows */
MOVE 9 IN cc;
FETCH NEXT FROM cc;
Continue performing the MOVE
and FETCH
in a loop until you run out of rows.