Search code examples
pythongoogle-cloud-platformgoogle-bigquerygoogle-cloud-dataflowapache-beam

Apache Beam with Dataflow: flag 'ignore_unknown_columns' for WriteToBigQuery not working


I am building a streaming pipeline using Apache Beam (Python SDK version 2.37.0) and Google Dataflow to write some data I receive via Pubsub to BigQuery.

I process the data and end up with rows represented by a dictionary like this:

{'val1': 17.4, 'val2': 40.8, 'timestamp': 1650456507, 'NA_VAL': 'table_name'}

I then want to use WriteToBigQuery to insert the values into my table.

However, my table only has the columns val1, val2, and timestamp. Therefore, NA_VAL should be ignored. From how I understand the docs, this should be possible by setting ignore_unknown_columns=True.

However, when running the pipeline in Dataflow, I still receive an error and no values are inserted into the table:

There were errors inserting to BigQuery. Will not retry. Errors were [{'index': 0, 'errors': [{'reason': 'invalid', 'location': 'NA_VAL', 'debugInfo': '', 'message': 'no such field: NA_VAL.'}]}]

I tried with a simple job configuration like this

rows | beam.io.WriteToBigQuery(
            table='PROJECT:DATASET.TABLE',
            ignore_unknown_columns=True)

as well as with those parameters

rows | beam.io.WriteToBigQuery(
            table='PROJECT:DATASET.TABLE',
            ignore_unknown_columns=True,
            create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
            method='STREAMING_INSERTS',
            insert_retry_strategy='RETRY_NEVER')

Question: Am I missing something here that is preventing the pipeline from working? Does anyone have the same issue and/or a solution for this?


Solution

  • Unfortunately you have been bitten by a bug. This was reported as https://issues.apache.org/jira/browse/BEAM-14039 and fixed by https://github.com/apache/beam/pull/16999. Version 2.38.0 will include this fix. Verification for that release just concluded today, so it should be available quite soon.