I have a 100s of GB-sized Spanner Table with two string and one DateTime column. I'm trying to expire older than X-days rows using a DELETE statement.
So far this didn't work out using the Ruby API, I'm getting Transaction Timeout errors or Google::Cloud::UnavailableError when I'm using:
spanner.execute_partition_update("DELETE FROM table WHERE datetime < '#{(Date.today - 300).strftime("%Y-%m-%dT%H:%M:%SZ")}'")
Unfortunately I don't have a Date column and I assume that the table is not partitioned. Is there a way to execute such a large DELETE statement at the moment, or can I fix this table in the long run by altering the schema to include a Date column?
I suspect that the DATETIME column is of type TIMESTAMP, per Spanner’s allowable types.
When running a partitioned DML statement, the table doesn’t have to be partitioned first since this is an automatic process done when executing the command. Here is more info about the process https://cloud.google.com/spanner/docs/dml-partitioned#dml_and_partitioned_dml
I would recommend using the “gcloud” command in this case to make sure the Ruby client library is not a contributing factor to the issue
Regarding the query, it is best to use the built-in Spanner functions from the official documentation to make sure it is optimized. In this case, using the CURRENT_TIMESTAMP() function along with the TIMESTAMP_SUB() function will make sure the query is correctly handled by Spanner when issuing a TIMESTAMP value subtracted of X number of days.
In this use case, the command to execute would be:
gcloud spanner databases execute-sql [db-name] \
--instance=[instance-name] --enable-partitioned-dml \
--sql=”DELETE FROM table WHERE datetime < TIMESTAMP_SUB(CURRENT_TIMESTAMP, INTERVAL 700 day)“
In order to reduce the number of rows read, STARTS_WITH() could be used against one of the STRING columns in the query, if there is a correlation with the TIMESTAMP column.