Search code examples
scalagoogle-bigquerygoogle-hadoop

Google BigQuery Spark Connector: How to ignore unknown values on append


We use the Google BigQuery Spark Connector to import data stored in Parquet files into BigQuery. Using custom tooling we generated a schema file needed by BigQuery and reference that in our import code (Scala).

However, our data doesn't really adhere to a fixed and well-defined schema, and in some cases additional columns may be added to individual datasets. That is why when experimenting with BigQuery using the command-line tool bq we almost always used --ignore_unknown_values since otherwise many imports would fail.

Unfortunately, we could not find an equivalent configuration option in the BigQuery Spark Connector com.google.cloud.bigdataoss:bigquery-connector:0.10.1-hadoop2. Does it exist?


Solution

  • This unfortunately isn't currently plumbed through the connector, and even if we add it now the official release would take several weeks to get deployed everywhere. I filed an issue to track this feature request in the github repository.

    In the meantime, if you want to build your own version of the connector, you can edit the JobConfigurationLoad settings explicitly, either in BigQueryRecordWriter if you're using the older "direct output format", or BigQueryHelper if you're using the newer "indirect output format", and add a line like:

    loadConfig.setIgnoreUnknownValues(true);