json google-bigquery google-cloud-storage google-cloud-functions google-api-client

Google Cloud Functions: loading GCS JSON files into BigQuery with non-standard keys

I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.

I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:

{
  "id": 12345,
  "@address": "XXXXXX"
  ...
}

Of course the "@address" key throws everything off and the cloud function errors out ...

Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.

I am contemplating the option of:

Loading the file in memory into a string var
Replace @address with address
Convert the json new line delimited to a list of dictionaries
Use bigquery stream insert to insert the rows in BQ

But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.

What other options do I have?

And no, I cannot modify the legacy system to rename the "@address" field :(

Thanks!

Solution

I'm going to assume the error that you are getting is something like this:

Errors: query: Invalid field name "@address". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.

This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.

Here's one way of doing it, which is completely serverless:

Create a Cloud Function to trigger on new files arriving in the bucket. You've already done this part by the sounds of things.
Create a templated Cloud Dataflow pipeline that is trigged by the Cloud Function when a new file arrives. It simply passes the name of the file to process to the pipeline.
In said Cloud Dataflow pipeline, read the JSON file into a ParDo, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "@" before creating your output TableRow object.
Write results to BigQuery. Under the hood, this will actually invoke a BigQuery load job.

To sum up, you'll need the following in the conga line:

File > GCS > Cloud Function > Dataflow (template) > BigQuery

The advantages of this:

Event driven
Scalable
Serverless/no-ops
You get monitoring alerting out of the box with Stackdriver
Minimal code

See:

disclosure: the last link is to a blog which was written by one of the engineers I work with.