Search code examples
google-cloud-spanner

Ruby API: Expiring Cloud Spanner Rows


I have a 100s of GB-sized Spanner Table with two string and one DateTime column. I'm trying to expire older than X-days rows using a DELETE statement.

So far this didn't work out using the Ruby API, I'm getting Transaction Timeout errors or Google::Cloud::UnavailableError when I'm using:

spanner.execute_partition_update("DELETE FROM table WHERE datetime < '#{(Date.today - 300).strftime("%Y-%m-%dT%H:%M:%SZ")}'")

Unfortunately I don't have a Date column and I assume that the table is not partitioned. Is there a way to execute such a large DELETE statement at the moment, or can I fix this table in the long run by altering the schema to include a Date column?


Solution

    1. I suspect that the DATETIME column is of type TIMESTAMP, per Spanner’s allowable types.

    2. When running a partitioned DML statement, the table doesn’t have to be partitioned first since this is an automatic process done when executing the command. Here is more info about the process https://cloud.google.com/spanner/docs/dml-partitioned#dml_and_partitioned_dml

    3. I would recommend using the “gcloud” command in this case to make sure the Ruby client library is not a contributing factor to the issue

    4. Regarding the query, it is best to use the built-in Spanner functions from the official documentation to make sure it is optimized. In this case, using the CURRENT_TIMESTAMP() function along with the TIMESTAMP_SUB() function will make sure the query is correctly handled by Spanner when issuing a TIMESTAMP value subtracted of X number of days.

    In this use case, the command to execute would be:

    gcloud spanner databases execute-sql [db-name] \
        --instance=[instance-name] --enable-partitioned-dml \
        --sql=”DELETE FROM table WHERE datetime < TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 700 day)“
    

    In order to reduce the number of rows read, STARTS_WITH() could be used against one of the STRING columns in the query, if there is a correlation with the TIMESTAMP column.