Search code examples
postgresqlamazon-rdsrelational-databaseamazon-auroraaws-aurora-serverless

How can I update a large table slowly?


I am using PostgresSQL 15 hosted by AWS Aurora serverless v2. I have a table includes 200 millions rows and I need to update a value on a column for all those rows. This table has primary key id which is generated by random uuid. It doesn't have any other index.

When run update Table set column=value on this table, it consumes all database cpu/memory which impacts production traffic. Since the table doesn't have any distribution key, I can't update the rows by page.

I am looking for a way to update it slowly. I don't mind if it takes a few days or a few weeks.


Solution

  • Do it in small batches

    create table already_updated as select id from your_table limit 0;
    create unique index on already_updated(id)with(fillfactor=100);
    

    This processes 10k at a time, trying to process the table in chunks, in the order it's written on disk according to the hidden system column ctid:

    with selected_batch as(
       select ctid 
       from your_table
       where id not in (table already_updated)
       order by ctid 
       limit 1e4)
    ,updated_batch as(
       update your_table
       set col1='new_value'
       from selected_batch 
       where selected_batch.ctid=your_table.ctid
       returning your_table.id)
    insert into already_updated table updated_batch;
    

    If that's really a static value, the same for all 200M records, you don't need the separate progress-tracking table or the consistency of id - the absence of 'new_value' means the row needs the update:

    with cte as(select ctid 
                from your_table
                where col1<>'new_value'
                limit 1e4)
    update your_table
    set col1='new_value'
    from cte
    where cte.ctid=your_table.ctid;
    

    Same thing if it's not entirely the same for all rows but still possible to tell a not-yet-updated row from one that's already processed in a where. If you're doing this on regular basis to update a field that's calculated based on other fields of that table, consider making it a generated column:

    alter table your_table 
       drop column if exists col1
     , add column col1 numeric generated always as (col2*col3) stored;
    

    That'll update on its own whenever any of the other two does.


    You can also consider hiding the table behind a view if you need an additional column with a static value that's always the same for all rows, or if relatively few need to diverge from it:
    demo at db<>fiddle

    create table test(id,x)as select generate_series(1,5e5),(random()*1e7)::int;
    alter table test add primary key(id);
    
    create table default_value_for_test(id,y)as values(0,'default value');
    alter table default_value_for_test 
       add primary key(id)
      ,add constraint only1row check(id=0);
    create table diverged_values_for_test(id,y)as values(1,'val1'),(4,'val4');
    alter table diverged_values_for_test 
       add primary key(id)
      ,add foreign key(id)references test(id)
        on update cascade on delete cascade;
    
    create view v_test as 
      select t.*, coalesce(diverged_.y,default_.y) as y
      from test as t(id,x) 
      left join diverged_values_for_test as diverged_(id,y)using(id)
      cross join default_value_for_test as default_(id,y);
    
    explain analyze verbose 
    update default_value_for_test set y='newer static value';
    
    Update on public.default_value_for_test (actual time=0.048..0.049 rows=0 loops=1)

      -> Seq Scan on public.default_value_for_test (actual time=0.009..0.010 rows=1 loops=1)

    Execution Time: 0.081 ms

    It took under .1ms to update the tiny single-row, single-column table that holds what you want to see as the additional, static column of the big one. Alternatively, you could also keep the default value as a literal constant in the view definition and whenever you need to change it, you use create or replace view to update it.

    You're also free to add exceptions and let the value diverge from that for some rows - if there's little traffic on the column, it'll take some time before it's viable to instead make the diverged_ an actual additional column in your table with 200M records, and make the default_ actual default on that column. Inserts and updates on diverged_ should also be way faster than updates on the main table.

    If you instead tried to save it in all 500k rows of the test table right now, it would obviously take more time and space:

    alter table test add column z text default 'another static value';
    explain analyze verbose 
    update test set z='another static value';
    
    Update on public.test (actual time=4894.635..4894.636 rows=0 loops=1)

      -> Seq Scan on public.test (actual time=0.009..143.322 rows=500000 loops=1)

    Execution Time: 4894.668 ms

    The cost of having your table split in three and moved behind a view is incurred on reads: the three tables obviously need to be joined together to get the full picture but the price of cross joining the single cell to every row is negligible and left join diverged_ depends on how many diverged.


    For batch updates, you can also declare a cursor and use update..where current of your_cursor.