Search code examples
sqljsonsnowflake-cloud-data-platformdata-extractionsnowsql

How to extract data from complex json stored in Snowflake via SnowSQL?


I have millions of JSON's stored in a single variant column table in Snowflake. They are in the following format, though the number of rows per JSON varies.

Please could someone give me some guidance on how to extract the data into a flat table? I'm new to working with JSON files and between the inconsistent number of rows and the lack of an indicator to define the object name is confusing me.

Here is a sample JSON:

{
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit": 0.2714572,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL": 8.732743,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL": 16.13105,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip": 1.3,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor": 4.167005,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor unit": "",
  "DeviceId": "streamingdevice",
  "EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
  "EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
  "IoTHub": {
    "ConnectionDeviceGenerationId": "637199801617320690",
    "ConnectionDeviceId": "streamingdevice",
    "CorrelationId": null,
    "EnqueuedTime": "2020-05-04T22:12:21.0000000",
    "MessageId": null,
    "StreamId": null
  },
  "PartitionId": 1,
  "Timestamp": "2019-10-30 13:48:05.000000"
}

"Edge 93 Belgium 43-23-19 1932" is an object name; each JSON is for a single object.

"Time_1_Avg.AB2 Weight on Bit" is the reading type, essentially made up of Tag1.Tag2.

The last part of the row is the reading value.

The timestamp at the bottom of the JSON is the reading time.

This section is not required:

  "DeviceId": "streamingdevice",
  "EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
  "EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
  "IoTHub": {
    "ConnectionDeviceGenerationId": "637199801617320690",
    "ConnectionDeviceId": "streamingdevice",
    "CorrelationId": null,
    "EnqueuedTime": "2020-05-04T22:12:21.0000000",
    "MessageId": null,
    "StreamId": null
  },
  "PartitionId": 1,

An ideal output for this data would be:

enter image description here

But just getting something like this would be really helpful:

enter image description here Thank you for your help!


Solution

  • Assuming that the required keys will always have 3 period-separated components, the following can be one form of solution:

    • Uses the FLATTEN table function to take any VARIANT typed column from a table (1-row constant in example) and explode it into multiple rows
    • Relies on the generated THIS column (from the FLATTEN table) to emit a row-constant value (Timestamp) for every exploded row
    • Uses a NOT IN filter to exclude unwanted key names
    • Uses the SPLIT function with indices to divide the extracted key into multiple columns
    SELECT
      SPLIT(KEY, '.')[0] AS "Object Name"
    , SPLIT(KEY, '.')[1] AS "Tag 1"
    , SPLIT(KEY, '.')[2] AS "Tag 2"
    , VALUE AS "Value"
    , THIS:Timestamp::TIMESTAMP AS "Timestamp"
    FROM TABLE(FLATTEN(PARSE_JSON('
    {
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit": 0.2714572,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP": 0,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP": 0,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement": -999.25,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement": -999.25,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality": -999.25,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow": -999.25,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK": -999.25,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction": -999.25,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps": -999.25,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time": -999.25,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL": 8.732743,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL": 16.13105,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip": 1.3,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque": -999.25,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque unit": "",
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor": 4.167005,
      "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor unit": "",
      "DeviceId": "streamingdevice",
      "EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
      "EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
      "IoTHub": {
        "ConnectionDeviceGenerationId": "637199801617320690",
        "ConnectionDeviceId": "streamingdevice",
        "CorrelationId": null,
        "EnqueuedTime": "2020-05-04T22:12:21.0000000",
        "MessageId": null,
        "StreamId": null
      },
      "PartitionId": 1,
      "Timestamp": "2019-10-30 13:48:05.000000"
    }
    ')))
    WHERE
      KEY NOT IN ('DeviceId', 'IoTHub', 'PartitionId', 'Timestamp', 'EventEnqueuedUtcTime', 'EventProcessedUtcTime');
    

    This should produce a result similar to your first screenshot:

    Split key into separate columns leveraging the period delimiter