We are working on a live Postgres Database on Heroku. We need to update around 5 million rows using regex replacing multiple strings, which will mean performing potentially around 100 million updates total.
We are updating it this way: (using psycopg2)
for element in list:
cursor.execute("Update table set text = regexp_replace(text, %s, 'NewWord', 'gi') where date >= '2017-12-31';", [element])
The database is live and linked to our Django website, and we need to roll a new feature within 3 days requiring this update to the database. Postgres Guides say that it's much faster if we remove indexes, but removing foreign key indexes might stop some of our django functionality and take the website down. Even so, we are able to take the side down for one to two days during the weekend, but nothing more.
So :
your query (for simplicity, I removed the parametrisation, and replaced {table,date,text} by {ztable,zdate,ztext}, because it are keywords):
Update ztable
set ztext = regexp_replace(ztext, 'Oldword', 'NewWord', 'gi')
where zdate >= '2017-12-31';
Can be speeded up a lot by adding an extra condition to the where clause, like in:
Update ztable
set ztext = regexp_replace(ztext, 'Oldword', 'NewWord', 'gi')
where zdate >= '2017-12-31'
AND ztext LIKE '%Oldword%'
;
This will avoid extra row-versions to be created if the update actually does nothing to a row. (an update costs 1 read-I/O + approx 3 I/Os if the row is actually changed, the number of affected disk blocks depends on the sparsety and the row size)
Extra note: if there is an index on the ztext
column: throw it away; it is probably useless.
Maybe extra performance can be gained by removing the loop in the front-end, and putting all logic in the UPDATE.