Search code examples
google-cloud-platformgoogle-bigquerygoogle-cloud-dataflowapache-beamgoogle-cloud-spanner

How to prevent GCP Dataflow data loss during re-start of job (flex template SpanenrToBigQuery)


I stream change stream data from Spanner into Big Query. I use the default --template-file-gcs-location=gs://dataflow-templates-us-central1/latest/flex/Spanner_Change_Streams_to_BigQuery flex template.

Sitatuation

  1. My Spanner schema changes sometimes
  2. When my Spanner schema changes, I manually make the changes to the BigQuery schema
  3. Then I have to cancel my dataflow pipeline. Because the the Spanner_Change_Streams_to_BigQuery template uses Splittable DoFunc I cannot drain
  4. Now, I am incurring data loss between shutdown of my current pipeline and start of the new one

Possible solution

  • Use watermark metrics to start a new pipeline

Not sure how and if this is even the correct approach. Is there a simpler way of preventing any data loss when I have to propagate schema changes?


I considered using const result = await dataflow.projects.jobs.getMetrics(request); to get all watermarks, but I ended up with multiple watermarks, for each stage of the template.

Is this the right approach? If so, how do I start a new flex template that has multiple stages, based on multiple watermarks?

DataWatermark F174 1697517770000000
{
  name: {
    origin: 'dataflow/v1b3',
    name: 'DataWatermark',
    context: { execution_step: 'F174' }
  },
  scalar: 1697517770000000,
  updateTime: '2023-10-17T04:43:06.711Z'
}
DataWatermark F174 1697517770000000
{
  name: {
    origin: 'dataflow/v1b3',
    name: 'DataWatermark',
    context: { execution_step: 'F175', tentative: 'true' }
  },
  scalar: 1697517770000000,
  updateTime: '2023-10-17T04:43:06.711Z'
}
DataWatermark F175 1697517770000000
{
  name: {
    origin: 'dataflow/v1b3',
    name: 'DataWatermark',
    context: { execution_step: 'F175' }
  },
  scalar: 1697517770000000,
  updateTime: '2023-10-17T04:43:06.711Z'
}
DataWatermark F175 1697517770000000

Solution

  • Turns out, my setup was to complex to properly understand what was happening under the hood.

    Key insight. The Spanner schema changes do not automatically propagate into BigQuery, but if the changes to the BQ schema are made before making changes to Spanner, there will be no data loss:

    Order of operations:

    1. Perform add column statement in BigQuery to dataset/table that the dataflow pipeline writes to
    2. No changes to data flow pipeline needed (make sure it's use_runner_v2 and streaming enabled)
    # I use these parameters in my `gcloud dataflow flex-template run` command
    ...
    experiments=use_runner_v2,\
    enableStreamingEngine=true,\
    ...
    
    1. Now make add_column changes to Spanner
    2. Deploy code that writes into new Spanner columns
    3. Data will automatically flow from Spanner to BQ, without any intervention or restart of the dataflow pipeline needed