Search code examples
postgresqlbulkupdate

How to backfill column in Postgres without locking the table?


I recently added a new column to my 40+ million row Postgres table (v9.6) with

ALTER TABLE queries
   ADD COLUMN ml_partition_source UUID;

Then in another transaction I executed

ALTER TABLE queries
   ALTER COLUMN ml_partition_source
      SET DEFAULT public.gen_random_uuid();

I did this in two transactions, because setting a default on a new column causes Postgres to rewrite the whole table, which can take hours and isn't acceptable in production.

Now, I'd like to backfill this column for all of the querys which existed before the new column was added without locking the table. One way to do this would be through a CRUD API I have but some rough calculations show that this would take ~22 days (maybe my API performance can be improved but that's a whole different question). Instead, I tried writing a postgres function:

CREATE OR REPLACE FUNCTION backfill_partition_source()
  RETURNS void AS $$
declare
  query_ record;
BEGIN
  for query_ in
  select * from api_mldata.queries where ml_partition_source is null
  loop
    update api_mldata.queries SET ml_partition_source = public.gen_random_uuid() where id = query_.id;
  end loop;
END;
$$ LANGUAGE plpgsql;

and executed that with select backfill_partition_source();. But that ended up locking the table too.

How can I backfill a column without impacting production (or with minimal production impact)?

EDIT: one idea I have is "chunking" the Postgres script to operate on 100k rows at a time or something like that and then executing the script in a loop. So the select statement would become

select * from api_mldata.queries
where ml_partition_source is null
limit 100000;

Solution

  • You cannot get away without locking at all, but you can keep the locks reasonably short.

    Rather than running many single-row updates in a loop, run bigger updates:

    UPDATE api_mldata.queries
    SET ml_partition_source = DEFAULT
    WHERE id BETWEEN 1 AND 999999;
    

    Here id is the primary key of the table.

    That way you can be done with a few bigger updates, each for a different range of ids.

    To avoid bloat and excessive locking, run each statement in its own transaction and launch an explicit VACUUM on the table between statements.