google-cloud-platform google-bigquery google-cloud-dataflow apache-beam google-cloud-spanner

Read spanner data from a table which is simultaneously being written

I'm copying Spanner data to BigQuery through a Dataflow job. The job is scheduled to run every 15 minutes. The problem is, if the data is read from a Spanner table which is also being written at the same time, some of the records get missed while copying to BigQuery.

I'm using readOnlyTransaction() while reading Spanner data. Is there any other precaution that I must take while doing this activity?

Solution

It is recommended to use Cloud Spanner commit timestamps to populate columns like update_date. Commit timestamps allow applications to determine the exact ordering of mutations.

Using commit timestamps for update_date and specifying an exact timestamp read, the Dataflow job will be able to find all existing records written/committed since the previous run.

https://cloud.google.com/spanner/docs/commit-timestamp

https://cloud.google.com/spanner/docs/timestamp-bounds