Search code examples
sqlpostgresqlconcurrencyparallel-processingsql-delete

How to parallelize SELECT and DELETE?


In PostgreSQL I have to fetch ids in small batches and delete them at same time, so that no 2 clients will fetch same ids.

Sadly, DELETE does not support LIMIT, so I came up with this query:

DELETE FROM codes
WHERE id IN (SELECT id FROM codes WHERE product = 'abc' LIMIT 100 FOR UPDATE)
RETURNING id

Sadly, this is slow and kind of serial algorithm (under certain circumstances) - you cannot SELECT while DELETE is running, not to select same id two times. And you cannot DELETE until you have SELECT result.

Which results in poor performance, when many clients came in parallel. What are my best options to parallelize/speedup it all?

PS: when product are different from different clients - database parallelizes it well, but when they are same - response time starts to grow linearly with amount of parallel clients requesting.


Solution

  • Assuming you don't care about the order in which codes are consumed - indicated by the absence of ORDER BY in your query.

    Use SKIP LOCKED to allow parallelization. Else, multiple concurrent calls regularly pile up, waiting for the previous one(s) to finish.

    Also, materialize your selection in a CTE, as the LIMIT clause does not interact well with a locking clause. See:

    WITH sel AS MATERIALIZED (
       SELECT id
       FROM   codes
       WHERE  product = 'abc'
       LIMIT  100
       FOR    UPDATE SKIP LOCKED
       )
    DELETE FROM codes c
    USING  sel s
    WHERE  c.id = s.id
    RETURNING c.id;
    

    The MATERIALIZED keyword is not strictly needed, as this CTE isn't going to be inlined anyway. But it won't hurt for clarity.

    Now you can run multiple instances of this command, each in a separate session, concurrently and safely.

    To make sure all rows have been processed, check after you think you are done. Like:

    SELECT EXISTS (SELECT FROM codes WHERE product = 'abc');
    

    If there are still rows left, run a final command without SKIP LOCKED. Loop the last two steps, to be absolutely sure. Or just loop the command without SKIP LOCKED.

    An index on codes(product) should help if the filter WHERE product = 'abc' is actually selective, i.e. only a very small percentage of total rows. Else, it's typically more cost than help for this, as the index adds write cost.