Search code examples
google-cloud-spanner

Updating column in large Spanner table with concurrent writes


I want to update a column in a large table (tens of millions of rows), without bringing our application down; i.e. there will be concurrent writes to the table. I want to do this in code (Java) since the update is non-trivial, and am wondering what the best way to do it is.

An efficient approach is to open a read transaction querying all rows, then loop over all these rows (resultSet.next()) while at the same time creating a series of read/write transactions with, say, 10,000 INSERT_OR_UPDATE mutations buffered in each transaction, containing the desired updates.

The problem is that this doesn't handle concurrent writes, since these steps can happen:

  1. The read transaction mentioned above reads row X
  2. Some separate transaction updates row X
  3. The read/write transaction mentioned above uses data from step 1 and overwrites the update from step 2

To fix this, I can read the value back during the read/write transaction and verify that it hasn't changed, similar to this example here, however this seems very slow (~50 ms per call, which translates to several weeks to update the whole table).

So how can I do this more efficiently? Thanks!


Solution

  • The best way to do that would be not to use a read-only transaction, but to start a read/write transaction for each batch of 10,000 records, read the values you want to update in this read/write transaction and then update these 10,000 records in the same read/write transaction. Repeat this until all records have been updated.

    So something like this:

    1. Start read/write transaction
    2. Read a batch of 10,000 records. Make sure that the order of the records is consistent by ordering on the primary key or some other unique (combination of) column(s). Limit the results by using LIMIT and OFFSET, so you get a query like this SELECT * FROM SOME_TABLE WHERE KEY>=@start AND KEY<@end LIMIT 10000
    3. Update the records and commit the transaction.
    4. Repeat until all records have been updated.