I am building a streaming pipeline using Apache Beam (Python SDK version 2.37.0) and Google Dataflow to write some data I receive via Pubsub to BigQuery.
I process the data and end up with rows represented by a dictionary like this:
{'val1': 17.4, 'val2': 40.8, 'timestamp': 1650456507, 'NA_VAL': 'table_name'}
I then want to use WriteToBigQuery
to insert the values into my table.
However, my table only has the columns val1
, val2
, and timestamp
. Therefore, NA_VAL
should be ignored. From how I understand the docs, this should be possible by setting ignore_unknown_columns=True
.
However, when running the pipeline in Dataflow, I still receive an error and no values are inserted into the table:
There were errors inserting to BigQuery. Will not retry. Errors were [{'index': 0, 'errors': [{'reason': 'invalid', 'location': 'NA_VAL', 'debugInfo': '', 'message': 'no such field: NA_VAL.'}]}]
I tried with a simple job configuration like this
rows | beam.io.WriteToBigQuery(
table='PROJECT:DATASET.TABLE',
ignore_unknown_columns=True)
as well as with those parameters
rows | beam.io.WriteToBigQuery(
table='PROJECT:DATASET.TABLE',
ignore_unknown_columns=True,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
method='STREAMING_INSERTS',
insert_retry_strategy='RETRY_NEVER')
Question: Am I missing something here that is preventing the pipeline from working? Does anyone have the same issue and/or a solution for this?
Unfortunately you have been bitten by a bug. This was reported as https://issues.apache.org/jira/browse/BEAM-14039 and fixed by https://github.com/apache/beam/pull/16999. Version 2.38.0 will include this fix. Verification for that release just concluded today, so it should be available quite soon.