Search code examples
jsongoogle-bigquerygoogle-cloud-storagegoogle-cloud-functionsgoogle-api-client

Google Cloud Functions: loading GCS JSON files into BigQuery with non-standard keys


I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.

I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:

{
  "id": 12345,
  "@address": "XXXXXX"
  ...
}

Of course the "@address" key throws everything off and the cloud function errors out ...

Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.

I am contemplating the option of:

  • Loading the file in memory into a string var
  • Replace @address with address
  • Convert the json new line delimited to a list of dictionaries
  • Use bigquery stream insert to insert the rows in BQ

But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.

What other options do I have?

And no, I cannot modify the legacy system to rename the "@address" field :(

Thanks!


Solution

  • I'm going to assume the error that you are getting is something like this:

    Errors: query: Invalid field name "@address". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.

    This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.

    Here's one way of doing it, which is completely serverless:

    1. Create a Cloud Function to trigger on new files arriving in the bucket. You've already done this part by the sounds of things.
    2. Create a templated Cloud Dataflow pipeline that is trigged by the Cloud Function when a new file arrives. It simply passes the name of the file to process to the pipeline.
    3. In said Cloud Dataflow pipeline, read the JSON file into a ParDo, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "@" before creating your output TableRow object.
    4. Write results to BigQuery. Under the hood, this will actually invoke a BigQuery load job.

    To sum up, you'll need the following in the conga line:

    File > GCS > Cloud Function > Dataflow (template) > BigQuery
    

    The advantages of this:

    1. Event driven
    2. Scalable
    3. Serverless/no-ops
    4. You get monitoring alerting out of the box with Stackdriver
    5. Minimal code

    See:

    disclosure: the last link is to a blog which was written by one of the engineers I work with.