I recently added a new column to my 40+ million row Postgres table (v9.6) with
ALTER TABLE queries
ADD COLUMN ml_partition_source UUID;
Then in another transaction I executed
ALTER TABLE queries
ALTER COLUMN ml_partition_source
SET DEFAULT public.gen_random_uuid();
I did this in two transactions, because setting a default
on a new column causes Postgres to rewrite the whole table, which can take hours and isn't acceptable in production.
Now, I'd like to backfill this column for all of the query
s which existed before the new column was added without locking the table. One way to do this would be through a CRUD API I have but some rough calculations show that this would take ~22 days (maybe my API performance can be improved but that's a whole different question). Instead, I tried writing a postgres function:
CREATE OR REPLACE FUNCTION backfill_partition_source()
RETURNS void AS $$
declare
query_ record;
BEGIN
for query_ in
select * from api_mldata.queries where ml_partition_source is null
loop
update api_mldata.queries SET ml_partition_source = public.gen_random_uuid() where id = query_.id;
end loop;
END;
$$ LANGUAGE plpgsql;
and executed that with select backfill_partition_source();
. But that ended up locking the table too.
How can I backfill a column without impacting production (or with minimal production impact)?
EDIT: one idea I have is "chunking" the Postgres script to operate on 100k rows at a time or something like that and then executing the script in a loop. So the select statement would become
select * from api_mldata.queries
where ml_partition_source is null
limit 100000;
You cannot get away without locking at all, but you can keep the locks reasonably short.
Rather than running many single-row updates in a loop, run bigger updates:
UPDATE api_mldata.queries
SET ml_partition_source = DEFAULT
WHERE id BETWEEN 1 AND 999999;
Here id
is the primary key of the table.
That way you can be done with a few bigger updates, each for a different range of id
s.
To avoid bloat and excessive locking, run each statement in its own transaction and launch an explicit VACUUM
on the table between statements.