I am trying to store messages sent from an IoT device in a BigQuery table.
The cloud architecture is as follows:
Local Device -> json_message -> mqtt_client -> GC IoT device -> Device Registry -> Pub/Sub Topic -> Dataflow with Pub/Sub Topic to BigQuery Template -> BigQuery Table
I have gotten this system working with a non-nested JSON message that is constructed like this
json_dict = {"instrument": instrument,
"spectrum": str(self.spectrum),
"spectrum_creation_time": self.creation_dt.timestamp(),
"messenger_creation_time": time.time()}
return json.dumps(json_dict)
The table in BigQuery, which successfully stores this data has the following schema:
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels
----------------- ------------------------------------ ------------ ------------- ------------ ------------------- ------------------ --------
04 Sep 00:24:22 |- instrument: string 1277 81897474
|- spectrum: string
|- spectrum_creation_time: string
|- messenger_creation_time: string
Now I am trying to get this same system working with a nested JSON message, which is constructed like this:
json_dict = {'timestamp': 'AUTO',
'values': [
{'id': instrument + '.Time',
'v': time.time(),
't': time.time()},
{'id': instrument + 'Intensity',
'v': str(self.spectrum),
't': self.creation_dt.timestamp()}
]}
return json.dumps(json_dict)
I am trying to store it in a BigQuery table with the following schema:
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels
----------------- ------------------------------ ------------ ------------- ------------ ------------------- ------------------ --------
09 Sep 23:56:20 |- timestamp: timestamp 0 0
+- values: record (repeated)
| +- time: record
| | |- id: string
| | |- v: string
| | |- t: timestamp
| +- spectrum: record
| | |- id: string
| | |- v: string
| | |- t: timestamp
Unfortunately, when I try this approach I get the following error, which is output to an error table in BigQuery by DataFlow.
{"errors":[{"debugInfo":"","location":"values[0].v","message":"no such field: v.","reason":"invalid"}],"index":0}
null
What is the best way to solve this issue? I cannot change the nested JSON structure, because I am building a test suite and this is the required format.
I was able to solve my issue with help from @Miach Kornfield who commented on my original question. Here is my solution.
The JSON data I sent to GCP looked like
json_dict = {"timestamp": "1631554378.2955232",
'values': [
{"id":"testA.Time",
"v": "1631554378.2955232",
"t": "1631554378.2955232"},
{"id": "testA.Time.Intensity",
"v": "[1, 43, 4..]",
't': "1631554378.2955232"}
]
}
The original schema for my bigquery table was
or in text form
Schema
------------------------------
|- timestamp: timestamp
+- values: record (repeated)
| +- time: record
| | |- id: string
| | |- v: string
| | |- t: timestamp
| +- spectrum: record
| | |- id: string
| | |- v: string
| | |- t: timestamp
The schema that worked was
or in text form
Schema
------------------------------
|- timestamp: timestamp
+- values: record (repeated)
| |- id: string
| |- v: string
| |- t: timestamp
By indicating that values is of type record (repeated) it means that it is an array of structs, with the structure of the structs specified by the subcolumns. The detail (that the structure of the structs was specified by the subcolumns) was not obvious to me, and why I had so much trouble resolving this issue. I am not sure if it is possible to have a record (repeated) with heterogeneous schema.