Search code examples
google-cloud-spanner

Unable to delete large number of rows from Spanner


I have 3 node Spanner instance, and a single table that contains around 4 billion rows. The DDL looks like this:

CREATE TABLE predictions (
    name STRING(MAX),
    ...,
    model_version INT64,
) PRIMARY KEY (name, model_version)

I'd like to setup a job to periodically remove some old rows from this table using the Python Spanner client. The query I'd like to run is:

DELETE FROM predictions WHERE model_version <> ? 

According to the docs, it sounds like I would need to execute this as a Partitioned DML statement. I am using the Python Spanner client as follows, but am experiencing timeouts (504 Deadline Exceeded errors) due to the large number of rows in my table.

# this always throws a "504 Deadline Exceeded" error
database.execute_partitioned_dml(
    "DELETE FROM predictions WHERE model_version <> @version",
    params={"model_version": 104},
    param_types={"model_version": Type(code=INT64)},
)

My first intuition was to see if there was some sort of timeout I could increase, but I don't see any timeout parameters in the source :/

I did notice there was a run_in_transaction method in the Spanner lib that contains a timeout parameter, so I decided to deviate from the partitioned DML approach to see if using this method worked. Here's what I ran:

def delete_old_rows(transaction, model_version):
    delete_dml = "DELETE FROM predictions WHERE model_version <> {}".format(model_version),
    dml_statements = [
        delete_dml,
    ]
    status, row_counts = transaction.batch_update(dml_statements)


database.run_in_transaction(delete_old_rows,
    model_version=104,
    timeout_secs=3600,
)

What's weird about this is the timeout_secs parameter appears to be ignored, because I still get a 504 Deadline Exceeded error within a minute or 2 of executing the above code, despite a timeout of one hour.

Anyways, I'm not too sure what to try next, or whether or not I'm missing something obvious that would allow me to run a delete query in a timely fashion on this huge Spanner table. The model_version column has pretty low cardinality (generally 2-3 unique model_version values in the entire table), so I'm not sure if that would factor into any recommendations. But if someone could offer some advice or suggestions, that would be awesome :) Thanks in advance


Solution

  • The first suggestion is to try gcloud instead.

    https://cloud.google.com/spanner/docs/modify-gcloud#modifying_data_using_dml

    Another suggestion is to pass the range of name as well so that limit the number of rows scanned. For example, you could add something like STARTS_WITH(name, 'a') to the WHERE clause so that make sure each transaction touches a small amount of rows but first, you will need to know about the domain of name column values.

    Last suggestion is try to avoid using '<>' if possible as it is generally pretty expensive to evaluate.