I stream change stream
data from Spanner into Big Query. I use the default --template-file-gcs-location=gs://dataflow-templates-us-central1/latest/flex/Spanner_Change_Streams_to_BigQuery
flex template.
Spanner_Change_Streams_to_BigQuery
template uses Splittable DoFunc I cannot drainNot sure how and if this is even the correct approach. Is there a simpler way of preventing any data loss when I have to propagate schema changes?
I considered using const result = await dataflow.projects.jobs.getMetrics(request);
to get all watermarks, but I ended up with multiple watermarks, for each stage of the template.
Is this the right approach? If so, how do I start a new flex template that has multiple stages, based on multiple watermarks?
DataWatermark F174 1697517770000000
{
name: {
origin: 'dataflow/v1b3',
name: 'DataWatermark',
context: { execution_step: 'F174' }
},
scalar: 1697517770000000,
updateTime: '2023-10-17T04:43:06.711Z'
}
DataWatermark F174 1697517770000000
{
name: {
origin: 'dataflow/v1b3',
name: 'DataWatermark',
context: { execution_step: 'F175', tentative: 'true' }
},
scalar: 1697517770000000,
updateTime: '2023-10-17T04:43:06.711Z'
}
DataWatermark F175 1697517770000000
{
name: {
origin: 'dataflow/v1b3',
name: 'DataWatermark',
context: { execution_step: 'F175' }
},
scalar: 1697517770000000,
updateTime: '2023-10-17T04:43:06.711Z'
}
DataWatermark F175 1697517770000000
Turns out, my setup was to complex to properly understand what was happening under the hood.
Key insight. The Spanner schema changes do not automatically propagate into BigQuery, but if the changes to the BQ schema are made before making changes to Spanner, there will be no data loss:
Order of operations:
add column
statement in BigQuery to dataset/table that the dataflow pipeline writes touse_runner_v2
and streaming
enabled)# I use these parameters in my `gcloud dataflow flex-template run` command
...
experiments=use_runner_v2,\
enableStreamingEngine=true,\
...
add_column
changes to Spanner