I have the following postgres query that's meant to update all message entries:
update message m set user_id = u.id from "user" u where u.user_pgid = m.user_pgid
The trouble is that I have ~300,000,000 message records, and postgres seems to be using only one core for this massive update, causing an IO timeout before the operation can complete.
How do I speed up this simple update command to make it as fast as possible?
the type in the first clause is a UUID, and in the second is i64.
SELECT
Postgres only uses parallel plans for SELECT
(and some commands based on a SELECT
, like CREATE TABLE AS
.) Queries modifying any rows cannot be parallelized. The manual:
Even when it is in general possible for parallel query plans to be generated, the planner will not generate them for a given query if any of the following are true:
- The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within a CTE, no parallel plans for that query will be generated.
To address what you asked: split up the table into N non-overlapping slices along user_pgid
, roughly equal in size - where N is the number of processes to employ in parallel. Should not exceed the number of CPU cores available. In total, stay below the I/O capacity of your DB server.
Create a PROCEDURE
like:
CREATE OR REPLACE PROCEDURE public.f_upd_message(
_lower_incl bigint
, _upper_excl bigint
, _step int = 50000
)
LANGUAGE plpgsql AS
$proc$
DECLARE
_low bigint;
_upd_ct int;
BEGIN
IF _upper_excl <= _lower_incl OR _step < 1 THEN
RAISE EXCEPTION '_upper_excl must be > _lower_inc & _step > 0! Was: _lower_incl: %, _upper_excl: %, _step: %'
, _lower_incl, _upper_excl, _step;
END IF;
FOR _low IN _lower_incl .. _upper_excl - 1 BY _step
LOOP
RAISE NOTICE 'user_pgid >= % AND user_pgid < %'
, _low, LEAST(_upper_excl, _low + _step); -- optional
UPDATE public.message m
SET user_id = u.id
FROM public."user" u
WHERE m.user_pgid >= _low
AND m.user_pgid < _low + _step
AND m.user_pgid < _upper_excl -- enforce upper bound
AND u.user_pgid = m.user_pgid
AND m.user_id <> u.id; -- ① suppress empty updates
GET DIAGNOSTICS upd_ct = ROW_COUNT; -- optional
COMMIT;
RAISE NOTICE 'Updated % rows', upd_ct; -- optional
END LOOP;
END
$proc$;
Call:
CALL public.f_upd_message(20000000, 30000000);
Or:
CALL public.f_upd_message(20000000, 30000000, 100000);
① Avoid empty updates (where the column value wouldn't change). If user_id
can be null, use null-safe comparison with IS DISTINCT FROM
.
Also prevents repeated updates in case you have to start over or mess up slices. See:
Base your slices on actual min and max user_pgid
:
SELECT min(user_pgid) AS _lower_incl, max(user_pgid) + 1 AS _upper_excl
FROM public.message;
Adjust the _step
size to your system. I added a default of 50000.
Then run N separate sessions each processing one slice. Like, start N psql instances (manually or in a shell script).
Each step is committed. If you should still run into a timeout (or any other problems), committed work is not rolled back.
Related:
The table will grow up to twice its size because every update leaves a dead tuple behind. You may want to run VACUUM FULL
afterwards - if you can afford to do so. Alternatively, issue VACUUM
in the procedure in reasonable intervals to make space from dead tuples available for reuse ...
See:
Various other optimizations are possible. Like drop and later recreate FK constraints, indexes, ... But you absolutely need an index on message(user_pgid)
for this!
If you are at liberty to do so, create an updated (sorted?) copy of the table instead of updating all rows. Like Frank already suggested. That gives you a pristine (clustered) table without bloat.