I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.
I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:
{
"id": 12345,
"@address": "XXXXXX"
...
}
Of course the "@address" key throws everything off and the cloud function errors out ...
Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.
I am contemplating the option of:
But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.
What other options do I have?
And no, I cannot modify the legacy system to rename the "@address" field :(
Thanks!
I'm going to assume the error that you are getting is something like this:
Errors: query: Invalid field name "@address". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.
Here's one way of doing it, which is completely serverless:
ParDo
, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "@" before creating your output TableRow
object.To sum up, you'll need the following in the conga line:
File > GCS > Cloud Function > Dataflow (template) > BigQuery
The advantages of this:
See:
disclosure: the last link is to a blog which was written by one of the engineers I work with.