I am using PostgresSQL 15 hosted by AWS Aurora serverless v2. I have a table includes 200 millions rows and I need to update a value on a column for all those rows. This table has primary key id which is generated by random uuid. It doesn't have any other index.
When run update Table set column=value
on this table, it consumes all database cpu/memory which impacts production traffic. Since the table doesn't have any distribution key, I can't update the rows by page.
I am looking for a way to update it slowly. I don't mind if it takes a few days or a few weeks.
Do it in small batches
create table already_updated as select id from your_table limit 0;
create unique index on already_updated(id)with(fillfactor=100);
This processes 10k at a time, trying to process the table in chunks, in the order it's written on disk according to the hidden system column ctid
:
with selected_batch as(
select ctid
from your_table
where id not in (table already_updated)
order by ctid
limit 1e4)
,updated_batch as(
update your_table
set col1='new_value'
from selected_batch
where selected_batch.ctid=your_table.ctid
returning your_table.id)
insert into already_updated table updated_batch;
If that's really a static value, the same for all 200M records, you don't need the separate progress-tracking table or the consistency of id
- the absence of 'new_value'
means the row needs the update:
with cte as(select ctid
from your_table
where col1<>'new_value'
limit 1e4)
update your_table
set col1='new_value'
from cte
where cte.ctid=your_table.ctid;
Same thing if it's not entirely the same for all rows but still possible to tell a not-yet-updated row from one that's already processed in a where
. If you're doing this on regular basis to update a field that's calculated based on other fields of that table, consider making it a generated column:
alter table your_table
drop column if exists col1
, add column col1 numeric generated always as (col2*col3) stored;
That'll update on its own whenever any of the other two does.
You can also consider hiding the table behind a view
if you need an additional column with a static value that's always the same for all rows, or if relatively few need to diverge from it:
demo at db<>fiddle
create table test(id,x)as select generate_series(1,5e5),(random()*1e7)::int;
alter table test add primary key(id);
create table default_value_for_test(id,y)as values(0,'default value');
alter table default_value_for_test
add primary key(id)
,add constraint only1row check(id=0);
create table diverged_values_for_test(id,y)as values(1,'val1'),(4,'val4');
alter table diverged_values_for_test
add primary key(id)
,add foreign key(id)references test(id)
on update cascade on delete cascade;
create view v_test as
select t.*, coalesce(diverged_.y,default_.y) as y
from test as t(id,x)
left join diverged_values_for_test as diverged_(id,y)using(id)
cross join default_value_for_test as default_(id,y);
explain analyze verbose
update default_value_for_test set y='newer static value';
Update on public.default_value_for_test (actual time=0.048..0.049 rows=0 loops=1) |
-> Seq Scan on public.default_value_for_test (actual time=0.009..0.010 rows=1 loops=1) |
Execution Time: 0.081 ms |
It took under .1ms
to update the tiny single-row, single-column table that holds what you want to see as the additional, static column of the big one. Alternatively, you could also keep the default value as a literal constant in the view definition and whenever you need to change it, you use create
or replace
view
to update it.
You're also free to add exceptions and let the value diverge from that for some rows - if there's little traffic on the column, it'll take some time before it's viable to instead make the diverged_
an actual additional column in your table with 200M records, and make the default_
actual default
on that column. Inserts and updates on diverged_
should also be way faster than updates on the main table.
If you instead tried to save it in all 500k rows of the test
table right now, it would obviously take more time and space:
alter table test add column z text default 'another static value';
explain analyze verbose
update test set z='another static value';
Update on public.test (actual time=4894.635..4894.636 rows=0 loops=1) |
-> Seq Scan on public.test (actual time=0.009..143.322 rows=500000 loops=1) |
Execution Time: 4894.668 ms |
The cost of having your table split in three and moved behind a view is incurred on reads: the three tables obviously need to be join
ed together to get the full picture but the price of cross join
ing the single cell to every row is negligible and left join diverged_
depends on how many diverged.
For batch updates, you can also declare
a cursor and use update..where current of your_cursor
.