I want to update a column in a large table (tens of millions of rows), without bringing our application down; i.e. there will be concurrent writes to the table. I want to do this in code (Java) since the update is non-trivial, and am wondering what the best way to do it is.
An efficient approach is to open a read transaction querying all rows, then loop over all these rows (resultSet.next()
) while at the same time creating a series of read/write transactions with, say, 10,000 INSERT_OR_UPDATE
mutations buffered in each transaction, containing the desired updates.
The problem is that this doesn't handle concurrent writes, since these steps can happen:
To fix this, I can read the value back during the read/write transaction and verify that it hasn't changed, similar to this example here, however this seems very slow (~50 ms per call, which translates to several weeks to update the whole table).
So how can I do this more efficiently? Thanks!
The best way to do that would be not to use a read-only transaction, but to start a read/write transaction for each batch of 10,000 records, read the values you want to update in this read/write transaction and then update these 10,000 records in the same read/write transaction. Repeat this until all records have been updated.
So something like this:
LIMIT
and OFFSET
, so you get a query like this SELECT * FROM SOME_TABLE WHERE KEY>=@start AND KEY<@end LIMIT 10000