Search code examples
sqlpostgresqlindexingrandompostgresql-performance

Efficiently mark contiguous subsets of rows in postgres


I have a Postgres table with several billion rows, and for a machine learning application need to divide it into train and test sets.

I want the test rows to be mostly contiguous by their id column, so would like to randomly select several chunks of 1,000 contiguous rows each and mark them as test rows. I have an index on the id column, so selecting any arbitrary 1,000 contiguous rows is fast:

UPDATE table SET test=true WHERE id BETWEEN 100000 AND 101000;

is very efficient, and uses an index scan as you would expect. Unfortunately as soon as I make the initial id be generated randomly, i.e.

WITH off AS (SELECT ROUND(random()*maxId))
  UPDATE table SET test=true
    WHERE id BETWEEN (SELECT * FROM off LIMIT 1)
                 AND (SELECT * FROM off LIMIT 1)+1000;

The query planner now decides to do a full table scan (much slower).

Of course if I only had to do this once, I would just manually generate a random row, no problem. However in the end I want a function that automatically divides into test and train like the one below:

CREATE OR REPLACE FUNCTION test_train_divide(chunkSize integer, proportion real)
RETURNS BOOLEAN
AS $$
DECLARE
maxId INTEGER := (SELECT MAX(id) FROM table);
BEGIN
FOR i IN 1 .. round(maxId*proportion/chunkSize) LOOP
  RAISE NOTICE 'Update call %', i;
  WITH off AS (SELECT ROUND(random()*maxId))
  UPDATE table SET test=true
    WHERE id BETWEEN (SELECT * FROM off LIMIT 1)
                 AND (SELECT * FROM off LIMIT 1)+chunkSize;
END LOOP;
return true;
END;
$$ LANGUAGE plpgsql;

SELECT test_train_divide(1000,0.01);

This works, but is insanely slow! Any pointers?

Update

Here is the schema

    tbl "public.tbl”
  Column   |  Type   | Modifiers 
-----------+---------+-----------
 subid     | integer | 
 id        | bigint  | 
 wordid    | integer | 
 capid     | integer | 
 test      | boolean | 
Indexes:
    “tbl_id_idx" btree (id)

And here are two different query plans, one good (using the index) and one bad:

will=# EXPLAIN UPDATE tbl SET test=true WHERE id BETWEEN 1000000 AND 1001000;

                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Update on tbl  (cost=0.57..790.45 rows=1079 width=38)
   ->  Index Scan using tbl_id_idx on tbl  (cost=0.57..790.45 rows=1079 width=38)
         Index Cond: ((id >= 1000000) AND (id <= 1001000))
(3 rows)


will=# EXPLAIN WITH start AS (SELECT round(random()*max(id)) FROM tbl) UPDATE tbl c SET test=true WHERE c.id BETWEEN (SELECT * FROM start LIMIT 1) AND (SELECT * FROM start LIMIT 1)+1000;



                                                             QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Update on tbl c  (cost=0.65..14932243.97 rows=1459961 width=38)
   CTE start
     ->  Result  (cost=0.59..0.61 rows=1 width=0)
           InitPlan 1 (returns $0)
             ->  Limit  (cost=0.57..0.59 rows=1 width=8)
                   ->  Index Only Scan Backward using tbl_id_idx on tbl  (cost=0.57..5846291.90 rows=288468819 width=8)
                         Index Cond: (id IS NOT NULL)
   InitPlan 3 (returns $2)
     ->  Limit  (cost=0.00..0.02 rows=1 width=8)
           ->  CTE Scan on start  (cost=0.00..0.02 rows=1 width=8)
   InitPlan 4 (returns $3)
     ->  Limit  (cost=0.00..0.02 rows=1 width=8)
           ->  CTE Scan on start start_1  (cost=0.00..0.02 rows=1 width=8)
   ->  Seq Scan on tbl c  (cost=0.00..14932243.32 rows=1459961 width=38)
         Filter: (((id)::double precision >= $2) AND ((id)::double precision <= ($3 + 1000::double precision)))
(15 rows)

Time: 2.649 ms

Solution

  • After initializing max_id as max(id) - 1000 to leave room for 1000 rows, this should be using the index:

    UPDATE table
    SET    test = true
    FROM  (SELECT (random() * max_id)::bigint AS lower_bound) t
    WHERE  id BETWEEN t.lower_bound AND t.lower_bound + 999;
    
    • No need for the complicated structure with a CTE and subqueries. Use a single subquery.

    • You original calculation yields a numeric (or dp), which may not go well with an index on a bigint column. Cast to bigint. (Should not be a problem in pg 9.3.)

    • BETWEEN includes lower and upper bound. Your upper bound should be lower + 999, strictly speaking.

    • random() returns (per documentation) a random value in the range 0.0 <= x < 1.0. To be completely fair, your lower_bound should really be calculated like this (assuming no gaps):

        trunc(random() * max_id)::bigint + 1
      

    If you need truly random numbers (or if your id has gaps), consider this related answer:

    Maybe advisory locks or a different approach may be useful. Compare this related, later answer: