Search code examples
djangopostgresqlheroku-postgres

Removing indexes before Update on a live Database


We are working on a live Postgres Database on Heroku. We need to update around 5 million rows using regex replacing multiple strings, which will mean performing potentially around 100 million updates total.

We are updating it this way: (using psycopg2)

for element in list:
    cursor.execute("Update table set text = regexp_replace(text, %s, 'NewWord', 'gi') where date >= '2017-12-31';", [element])

The database is live and linked to our Django website, and we need to roll a new feature within 3 days requiring this update to the database. Postgres Guides say that it's much faster if we remove indexes, but removing foreign key indexes might stop some of our django functionality and take the website down. Even so, we are able to take the side down for one to two days during the weekend, but nothing more.

So :

  1. Is it safe to assume that by removing the Indexes, 100 million updates could be done in a day?
  2. if so, Should we also remove Primary Key Indexes?
  3. If not, how much time would we assume a similar update will take without removing the Indexes?

Solution

    1. 100 million updates can be done within an hour. (if the row size is not too large)
    2. No. The primary key is not affected by updating the text field, so you should leave it alone
    3. The time needed for an update is approximately the same with or without an index (if there is no index on the affected text field)

    your query (for simplicity, I removed the parametrisation, and replaced {table,date,text} by {ztable,zdate,ztext}, because it are keywords):

    Update ztable
    set ztext = regexp_replace(ztext, 'Oldword', 'NewWord', 'gi')
    where zdate >= '2017-12-31';
    

    Can be speeded up a lot by adding an extra condition to the where clause, like in:

    Update ztable
    set ztext = regexp_replace(ztext, 'Oldword', 'NewWord', 'gi')
    where zdate >= '2017-12-31'
    AND ztext LIKE '%Oldword%'
    ;
    

    This will avoid extra row-versions to be created if the update actually does nothing to a row. (an update costs 1 read-I/O + approx 3 I/Os if the row is actually changed, the number of affected disk blocks depends on the sparsety and the row size)

    Extra note: if there is an index on the ztext column: throw it away; it is probably useless. Maybe extra performance can be gained by removing the loop in the front-end, and putting all logic in the UPDATE.