Search code examples
google-bigqueryavro

Loading Avro-file to BigQuery fails with an internal error


Google BigQuery has on March 23, 2016 announced "Added support for Avro source format for load operations and as a federated data source in the BigQuery API or command-line tool". It says here "This is a Beta release of Avro format support. This feature is not covered by any SLA or deprecation policy and may be subject to backward-incompatible changes.". However, I'd expect the feature to work.

I didn't find anywhere code examples on how to use Avro format for loading. Neither I did find examples on how to use bq-tool for loading.

Here's my practical issue. I haven't been able to load data into BigQuery in Avro-format.

The following happens using bq-tool. The dataset, table name and bucket name have been obfuscated:

$ bq extract --destination_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro Waiting on bqjob_r62088699049ce969_0000015432b7627a_1 ... (36s) Current status: DONE $ bq load --source_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro Waiting on bqjob_r6cefe75ece6073a1_0000015432b83516_1 ... (2s) Current status: DONE BigQuery error in load operation: Error processing job 'dataset:bqjob_r6cefe75ece6073a1_0000015432b83516_1': An internal error occurred and the request could not be completed.

Basically, I am extracting from a table and inserting to the same table causing an internal error.

Additionally, I have Java program that does the same (extract from table X and load to table X) with the same result (internal error). But I think the above illustrates the problem as clearly as possible, and because of that I'm not sharing the code here. In Java, If I extract from an empty table and insert that, the insert job doesn't fail.

My questions are

  • I think BigQuery API should never fail with internal error. Why is that happening with my test?
  • Is the extracted Avro file compatible with an insert job?
  • There seems to be no specification what the Avro schema in an insert job is like, at least I couldn't find any. Could the documentation be created?

UPDATED 2016-04-25:

So far I've managed to get an Avro load job not to give an internal error based on the hint of not using REQUIRED fields. However, I haven't managed to load non-null values.

Consider this Avro-schema:

{ "type": "record", "name": "root", "fields": [ { "name": "x", "type": "string" } ] }

The BigQuery table has one column, x that is NULLABLE.

If I insert N (I've tried with one and two) rows (x being e.g. 1), I got N rows in BigQuery but x always having value null.

If I change the table so that X is REQUIRED I get an internal error.


Solution

  • There is no exact match from a BQ schema to Avro schema, and vice versa, so when you export a BQ table to Avro file and then import back, the schema will be different. I see the destination table of your load already exists, in this case we throw an error when the schema of the destination table doesn't match the schema we converted from the Avro schema. This should be an external error though, we're investigating why it's an internal error.

    We're in the middle of upgrading the export pipeline, and the new import pipeline has a bug that doesn't work with the Avro file exported by the current pipeline. The fix should be deployed in a couple weeks. After that, if you import the exported file to a non-existent destination table, or a destination table with compatible schema, it should work. Meanwhile, importing your own Avro files should work. You can also query it directly on GCS without importing it.