Search code examples
jsondictionarygoogle-bigquerykey-value

Storing JSON data in Bigquery as nested dictionary


Consider some data:

WITH
  sequences AS (
    SELECT 0 as primary_key, 0 AS key1, 1 AS key2, 3 AS value1, 4 AS value2 UNION ALL
    SELECT 0, 1, 0, 8, 5 UNION ALL
    SELECT 0, 1, 1, 1, 2 UNION ALL
    SELECT 1, 2, 1, 1, 1 UNION ALL
    SELECT 1, 2, 3, 4, 4
  ),
  agg AS (
  SELECT
    primary_key,
    key1,
    key2,
    ARRAY_AGG((SELECT AS STRUCT f.* EXCEPT (key1, key2))) AS agg
  FROM sequences AS f
  GROUP BY
    primary_key,
    key1,
    key2
)
SELECT
  a.primary_key,
  a.key1,
  a.key2,
  ANY_VALUE(TO_JSON_STRING((agg))) AS result
FROM
  agg AS a
GROUP BY
  a.primary_key,
  a.key1,
  a.key2

This gives me data in format of: enter image description here

However for future faster retrieval; I would like to get data in format of: enter image description here

Any pointer on how I can make this happen?


Solution

  • Thank you so much for such a rich post. It is fantastic that you included everything needed to tinker with a solution. Your well described input data definitely made me want to spend time examining.

    This is one possible solution:

    WITH
      sequences AS (
        SELECT 0 as primary_key, 0 AS key1, 1 AS key2, 3 AS value1, 4 AS value2 UNION ALL
        SELECT 0, 1, 0, 8, 5 UNION ALL
        SELECT 0, 1, 1, 1, 2 UNION ALL
        SELECT 1, 2, 1, 1, 1 UNION ALL
        SELECT 1, 2, 3, 4, 4
      ),
      agg AS (
      SELECT
        primary_key,
        key1,
        key2,
        ARRAY_AGG((SELECT AS STRUCT f.* EXCEPT (key1, key2))) AS agg
      FROM sequences AS f
      GROUP BY
        primary_key,
        key1,
        key2
    )
    SELECT primary_key,
      JSON_ARRAY(
        JSON_OBJECT(CAST(key1 AS STRING), 
          JSON_OBJECT(CAST(key2 AS STRING),agg.agg[0])
        )
      ) AS result FROM agg
    

    In this example, I am using the JSON functions as described in the BigQuery documentation (ref). The solution uses the JSON_ARRAY function to create an array and then the elements of the array are nested JSON_OBJECT instances using the data from the query as the values.