I have a Postgres table with several billion rows, and for a machine learning application need to divide it into train and test sets.
I want the test rows to be mostly contiguous by their id
column, so would like to randomly select
several chunks of 1,000 contiguous rows each and mark them as test rows. I have an index on the id
column, so selecting any arbitrary 1,000 contiguous rows is fast:
UPDATE table SET test=true WHERE id BETWEEN 100000 AND 101000;
is very efficient, and uses an index scan as you would expect. Unfortunately as soon as I make the initial id
be generated randomly, i.e.
WITH off AS (SELECT ROUND(random()*maxId))
UPDATE table SET test=true
WHERE id BETWEEN (SELECT * FROM off LIMIT 1)
AND (SELECT * FROM off LIMIT 1)+1000;
The query planner now decides to do a full table scan (much slower).
Of course if I only had to do this once, I would just manually generate a random row, no problem. However in the end I want a function that automatically divides into test and train like the one below:
CREATE OR REPLACE FUNCTION test_train_divide(chunkSize integer, proportion real)
RETURNS BOOLEAN
AS $$
DECLARE
maxId INTEGER := (SELECT MAX(id) FROM table);
BEGIN
FOR i IN 1 .. round(maxId*proportion/chunkSize) LOOP
RAISE NOTICE 'Update call %', i;
WITH off AS (SELECT ROUND(random()*maxId))
UPDATE table SET test=true
WHERE id BETWEEN (SELECT * FROM off LIMIT 1)
AND (SELECT * FROM off LIMIT 1)+chunkSize;
END LOOP;
return true;
END;
$$ LANGUAGE plpgsql;
SELECT test_train_divide(1000,0.01);
This works, but is insanely slow! Any pointers?
Here is the schema
tbl "public.tbl”
Column | Type | Modifiers
-----------+---------+-----------
subid | integer |
id | bigint |
wordid | integer |
capid | integer |
test | boolean |
Indexes:
“tbl_id_idx" btree (id)
And here are two different query plans, one good (using the index) and one bad:
will=# EXPLAIN UPDATE tbl SET test=true WHERE id BETWEEN 1000000 AND 1001000;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Update on tbl (cost=0.57..790.45 rows=1079 width=38)
-> Index Scan using tbl_id_idx on tbl (cost=0.57..790.45 rows=1079 width=38)
Index Cond: ((id >= 1000000) AND (id <= 1001000))
(3 rows)
will=# EXPLAIN WITH start AS (SELECT round(random()*max(id)) FROM tbl) UPDATE tbl c SET test=true WHERE c.id BETWEEN (SELECT * FROM start LIMIT 1) AND (SELECT * FROM start LIMIT 1)+1000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Update on tbl c (cost=0.65..14932243.97 rows=1459961 width=38)
CTE start
-> Result (cost=0.59..0.61 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..0.59 rows=1 width=8)
-> Index Only Scan Backward using tbl_id_idx on tbl (cost=0.57..5846291.90 rows=288468819 width=8)
Index Cond: (id IS NOT NULL)
InitPlan 3 (returns $2)
-> Limit (cost=0.00..0.02 rows=1 width=8)
-> CTE Scan on start (cost=0.00..0.02 rows=1 width=8)
InitPlan 4 (returns $3)
-> Limit (cost=0.00..0.02 rows=1 width=8)
-> CTE Scan on start start_1 (cost=0.00..0.02 rows=1 width=8)
-> Seq Scan on tbl c (cost=0.00..14932243.32 rows=1459961 width=38)
Filter: (((id)::double precision >= $2) AND ((id)::double precision <= ($3 + 1000::double precision)))
(15 rows)
Time: 2.649 ms
After initializing max_id
as max(id) - 1000
to leave room for 1000 rows, this should be using the index:
UPDATE table
SET test = true
FROM (SELECT (random() * max_id)::bigint AS lower_bound) t
WHERE id BETWEEN t.lower_bound AND t.lower_bound + 999;
No need for the complicated structure with a CTE and subqueries. Use a single subquery.
You original calculation yields a numeric
(or dp
), which may not go well with an index on a bigint
column. Cast to bigint
. (Should not be a problem in pg 9.3.)
BETWEEN
includes lower and upper bound. Your upper bound should be lower + 999
, strictly speaking.
random()
returns (per documentation) a random value in the range 0.0 <= x < 1.0
. To be completely fair, your lower_bound
should really be calculated like this (assuming no gaps):
trunc(random() * max_id)::bigint + 1
If you need truly random numbers (or if your id
has gaps), consider this related answer:
Maybe advisory locks or a different approach may be useful. Compare this related, later answer: