Search code examples
jsongoogle-cloud-platformgoogle-bigqueryiotgoogle-dataflow

How to put nested JSON data into BigQuery table with Google Cloud Platform's dataflow's Pub/Sub Topic -> BigQuery Template


I am trying to store messages sent from an IoT device in a BigQuery table.

The cloud architecture is as follows:

Local Device -> json_message -> mqtt_client -> GC IoT device -> Device Registry -> Pub/Sub Topic -> Dataflow with Pub/Sub Topic to BigQuery Template -> BigQuery Table

I have gotten this system working with a non-nested JSON message that is constructed like this

json_dict = {"instrument": instrument,
 "spectrum": str(self.spectrum),
 "spectrum_creation_time": self.creation_dt.timestamp(),
 "messenger_creation_time": time.time()}

return json.dumps(json_dict)

The table in BigQuery, which successfully stores this data has the following schema:

   Last modified                  Schema                 Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Labels
 ----------------- ------------------------------------ ------------ ------------- ------------ ------------------- ------------------ --------
  04 Sep 00:24:22   |- instrument: string                1277         81897474
                    |- spectrum: string
                    |- spectrum_creation_time: string
                    |- messenger_creation_time: string

Now I am trying to get this same system working with a nested JSON message, which is constructed like this:

json_dict = {'timestamp': 'AUTO',
             'values': [
                        {'id': instrument + '.Time',
                         'v': time.time(),
                         't': time.time()},
                        {'id': instrument + 'Intensity',
                         'v': str(self.spectrum),
                         't': self.creation_dt.timestamp()}
                        ]}

return json.dumps(json_dict)

I am trying to store it in a BigQuery table with the following schema:

   Last modified               Schema              Total Rows   Total Bytes   Expiration   Time Partitioning   Clustered Fields   Labels
 ----------------- ------------------------------ ------------ ------------- ------------ ------------------- ------------------ --------
  09 Sep 23:56:20   |- timestamp: timestamp        0            0
                    +- values: record (repeated)
                    |  +- time: record
                    |  |  |- id: string
                    |  |  |- v: string
                    |  |  |- t: timestamp
                    |  +- spectrum: record
                    |  |  |- id: string
                    |  |  |- v: string
                    |  |  |- t: timestamp

Unfortunately, when I try this approach I get the following error, which is output to an error table in BigQuery by DataFlow.

{"errors":[{"debugInfo":"","location":"values[0].v","message":"no such field: v.","reason":"invalid"}],"index":0}
null

What is the best way to solve this issue? I cannot change the nested JSON structure, because I am building a test suite and this is the required format.


Solution

  • I was able to solve my issue with help from @Miach Kornfield who commented on my original question. Here is my solution.

    The JSON data I sent to GCP looked like

    json_dict = {"timestamp": "1631554378.2955232",
                 'values': [
                            {"id":"testA.Time",
                             "v": "1631554378.2955232",
                             "t": "1631554378.2955232"},
                            {"id": "testA.Time.Intensity",
                             "v": "[1, 43, 4..]",
                             't': "1631554378.2955232"}
                             ]
                }
    

    The original schema for my bigquery table was

    Original schema for bigquery

    or in text form

                Schema
    ------------------------------
     |- timestamp: timestamp
     +- values: record (repeated)
     |  +- time: record
     |  |  |- id: string
     |  |  |- v: string
     |  |  |- t: timestamp
     |  +- spectrum: record
     |  |  |- id: string
     |  |  |- v: string
     |  |  |- t: timestamp
    

    The schema that worked was

    schema that worked

    or in text form

                Schema
    ------------------------------
     |- timestamp: timestamp
     +- values: record (repeated)
     |  |- id: string
     |  |- v: string
     |  |- t: timestamp
    
    

    By indicating that values is of type record (repeated) it means that it is an array of structs, with the structure of the structs specified by the subcolumns. The detail (that the structure of the structs was specified by the subcolumns) was not obvious to me, and why I had so much trouble resolving this issue. I am not sure if it is possible to have a record (repeated) with heterogeneous schema.