How to update many rows at once in parallel with postgres query?

I have the following postgres query that's meant to update all message entries:

update message m set user_id = u.id from "user" u where u.user_pgid = m.user_pgid

The trouble is that I have ~300,000,000 message records, and postgres seems to be using only one core for this massive update, causing an IO timeout before the operation can complete.

How do I speed up this simple update command to make it as fast as possible?

the type in the first clause is a UUID, and in the second is i64.

Solution

Parallel query plans only for `SELECT`

Postgres only uses parallel plans for SELECT (and some commands based on a SELECT, like CREATE TABLE AS.) Queries modifying any rows cannot be parallelized. The manual:

Even when it is in general possible for parallel query plans to be generated, the planner will not generate them for a given query if any of the following are true:

The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within a CTE, no parallel plans for that query will be generated.

Manual parallelization

To address what you asked: split up the table into N non-overlapping slices along user_pgid, roughly equal in size - where N is the number of processes to employ in parallel. Should not exceed the number of CPU cores available. In total, stay below the I/O capacity of your DB server.

Create a PROCEDURE like:

CREATE OR REPLACE PROCEDURE public.f_upd_message(
     _lower_incl bigint
   , _upper_excl bigint
   , _step int = 50000
   )
  LANGUAGE plpgsql AS
$proc$
DECLARE
   _low bigint;
   _upd_ct int;
BEGIN
   IF _upper_excl <= _lower_incl OR _step < 1 THEN
      RAISE EXCEPTION '_upper_excl must be > _lower_inc & _step > 0! Was: _lower_incl: %, _upper_excl: %, _step: %'
                     , _lower_incl, _upper_excl, _step;
   END IF;
   
   FOR _low IN _lower_incl .. _upper_excl - 1 BY _step
   LOOP
      RAISE NOTICE 'user_pgid >= % AND user_pgid < %'
                  , _low, LEAST(_upper_excl, _low + _step);  -- optional

      UPDATE public.message m
      SET    user_id = u.id
      FROM   public."user" u
      WHERE  m.user_pgid >= _low
      AND    m.user_pgid <  _low + _step
      AND    m.user_pgid <  _upper_excl  -- enforce upper bound
      AND    u.user_pgid = m.user_pgid
      AND    m.user_id <> u.id;          -- ① suppress empty updates
      
      GET DIAGNOSTICS upd_ct = ROW_COUNT;  -- optional

      COMMIT;

      RAISE NOTICE 'Updated % rows', upd_ct;  -- optional
   END LOOP;
END
$proc$;

Call:

CALL public.f_upd_message(20000000, 30000000);

Or:

CALL public.f_upd_message(20000000, 30000000, 100000);

① Avoid empty updates (where the column value wouldn't change). If user_id can be null, use null-safe comparison with IS DISTINCT FROM.
Also prevents repeated updates in case you have to start over or mess up slices. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

Base your slices on actual min and max user_pgid:

SELECT min(user_pgid) AS _lower_incl, max(user_pgid) + 1 AS _upper_excl
FROM   public.message;

Adjust the _step size to your system. I added a default of 50000.

Then run N separate sessions each processing one slice. Like, start N psql instances (manually or in a shell script).

Each step is committed. If you should still run into a timeout (or any other problems), committed work is not rolled back.

Side effects, notes

The table will grow up to twice its size because every update leaves a dead tuple behind. You may want to run VACUUM FULL afterwards - if you can afford to do so. Alternatively, issue VACUUM in the procedure in reasonable intervals to make space from dead tuples available for reuse ... See:

VACUUM returning disk space to operating system

Various other optimizations are possible. Like drop and later recreate FK constraints, indexes, ... But you absolutely need an index on message(user_pgid) for this!

If you are at liberty to do so, create an updated (sorted?) copy of the table instead of updating all rows. Like Frank already suggested. That gives you a pristine (clustered) table without bloat.

How to update many rows at once in parallel with postgres query?

Parallel query plans only for SELECT

Manual parallelization

Side effects, notes

Parallel query plans only for `SELECT`