Search code examples
google-bigqueryuser-defined-functionsjson-extract

BigQuery UDF user defined functions return types


I am using the following SQL (from another question) which contains temporary functions.

create temp function  extract_keys(input string) returns array<string> language js as """
  return Object.keys(JSON.parse(input));
  """;
create temp function  extract_values(input string) returns array<string> language js as """
  return Object.values(JSON.parse(input));
  """;
create temp function extract_all_leaves(input string) returns string language js as '''
  function flattenObj(obj, parent = '', res = {}){
    for(let key in obj){
        let propName = parent ? parent + '.' + key : key;
        if(typeof obj[key] == 'object'){
            flattenObj(obj[key], propName, res);
        } else {
            res[propName] = obj[key];
        }
    }
    return JSON.stringify(res);
  }
  return flattenObj(JSON.parse(input));
  ''';
select col || replace(replace(key, 'value', ''), '.', '-') as col, value, 
from your_table,
unnest([struct(extract_all_leaves(data) as json)]),
unnest(extract_keys(json)) key with offset
join unnest(extract_values(json)) value with offset
using(offset)    

I want to save the above query as a view, but I cannot include the temporary functions, so I planned to define these as user-defined functions that can be called as part of the view.

When defining the functions, I'm having some trouble getting the input and output types defined correctly. Here's the three user defined functions.

CREATE OR REPLACE FUNCTION `dataset.json_extract_all_leaves`(Obj String)
RETURNS String
LANGUAGE js AS """
  function flattenObj(obj, parent = '', res = {}){
    for(let key in obj){
        let propName = parent ? parent + '.' + key : key;
        if(typeof obj[key] == 'object'){
            flattenObj(obj[key], propName, res);
        } else {
            res[propName] = obj[key];
        }
    }
    return JSON.stringify(res);
  }
  return flattenObj(JSON.parse(input));
"""
CREATE OR REPLACE FUNCTION `dataset.json_extract_keys`(input String)
RETURNS Array<String>
LANGUAGE js AS """
  return Object.keys(JSON.parse(input));
"""

finally

CREATE OR REPLACE FUNCTION `dataform.json_extract_values`(input STRING)
RETURNS Array<String>
LANGUAGE js AS """
  return Object.values(JSON.parse(input));
"""

Those three functions are created successfully, but when I come to use them in this view

WITH extract_all AS (
  select 
  id,
  field,
  created,
  key || replace(replace(key, 'value', ''), '.', '-') as key_name, value, 
  FROM `dataset.raw_keys_and_values`,
  unnest([struct(`dataset.json_extract_all_leaves`(setting_value) as json)]),
  unnest(`dataset.json_extract_keys`(json)) key with offset
  join unnest(`dataset.json_extract_values`(json)) value with offset
  using(offset)      
)


SELECT *
FROM 
extract_all

This fails with the following error

Error: Multiple errors occurred during the request. Please see the `errors` array for complete details. 1. Failed to coerce output value "{\"value\":true}" to type ARRAY<STRING>

I understand there's a mismatch somewhere between the expected return value of json_extract_values, but I can't understand if it's in the SQL or JavaScript UDF?


Solution

  • Revised Answer

    I've given the original ask another read and contrasted with some experimentation in my test data set.

    While I'm unable to reproduce the given error, I did experience related difficulty with the following line:

    unnest([struct(`dataset.json_extract_all_leaves`(setting_value) as json)]),
    

    Put simply, the function being called takes a string (presumably a stringified JSON value) and returns a similarly stringified JSON value with the result. Because UNNEST can only be used with arrays, the author surrounds the output with [struct and ] which may be the issue. Again, in an effort to yield the same result as I do below, but using the original functions, I would propose that the SQL block be updated to the following:

    create temp function  extract_keys(input string) returns array<string> language js as """
      return Object.keys(JSON.parse(input));
      """;
    create temp function  extract_values(input string) returns array<string> language js as """
      return Object.values(JSON.parse(input));
      """;
    create temp function extract_all_leaves(input string) returns string language js as '''
      function flattenObj(obj, parent = '', res = {}){
        for(let key in obj){
            let propName = parent ? parent + '.' + key : key;
            if(typeof obj[key] == 'object'){
                flattenObj(obj[key], propName, res);
            } else {
                res[propName] = obj[key];
            }
        }
        return JSON.stringify(res);
      }
      return flattenObj(JSON.parse(input));
      ''';
    WITH extract_all AS (
      select 
        id, 
        field,
        created,
        properties
        FROM 
          UNNEST([
            STRUCT<id int, field string, created DATE, properties string>(1, 'michael', DATE(2022, 5, 1), '[[{"name":"Andy","age":7},{"name":"Mark","age":5},{"name":"Courtney","age":6}], [{"name":"Austin","age":8},{"name":"Erik","age":6},{"name":"Michaela","age":6}]]'),
            STRUCT<id int, field string, created DATE, properties string>(2, 'sarah', DATE(2022, 5, 2), '[{"name":"Angela","age":9},{"name":"Ryan","age":7},{"name":"Andrew","age":7}]'),
            STRUCT<id int, field string, created DATE, properties string>(3, 'rosy', DATE(2022, 5, 3), '[{"name":"Brynn","age":4},{"name":"Cameron","age":3},{"name":"Rebecca","age":5}]')
          ])
        AS myData
    )
    
    SELECT
      id,
      field,
      created,
      key,
      value
    FROM (
      SELECT 
        *
      FROM extract_all,
      UNNEST(extract_keys(extract_all_leaves(properties))) key WITH OFFSET
      JOIN UNNEST(extract_values(extract_all_leaves(properties))) value WITH OFFSET
      USING(OFFSET)
    )
    

    Put simply - remove the extract_all_leaves line with its array casting and perform it in the offset-joined pair of keys and values, then put all that in a subquery so you can cleanly pull out just the columns you want.

    And to explicitly answer the asked question, I believe the issue is in the SQL because of the type casting in the offending line and my own inability to get it to cleanly pair with the subsequent UNNEST queries against its output.

    Original Answer

    I gather that you've got some sort of JSON object in your settings_value field and you're trying to sift out a result that shows the keys and values of that object alongside the other columns in your dataset.

    As others mentioned in the comments, this is a bit of a puzzle to figure out precisely why your query isn't working without any sample data, so happy to re-visit this if you can provide a record or two I can drop in to validate against, but here's an end-to-end that yields my guess as to what you're aiming for. In lieu of that, I've created some sample records intended to be in the same spirit of what you provided.

    Based on your use of joining by the offset, I'm supposing that you're really just wanting to see all the keys and their values, paired with the other columns. Assuming that's true, I propose using a different JavaScript function that yields an array of all the key/value pairs instead of two separate functions to yield their own arrays. It simplifies the query (and more importantly, works):

    create temp function extract_all_leaves(input string) returns string language js as r'''
      function flattenObj(obj, parent = '', res = {}){
        for(let key in obj){
            let propName = parent ? parent + '.' + key : key;
            if(typeof obj[key] == 'object'){
                flattenObj(obj[key], propName, res);
            } else {
                res[propName] = obj[key];
            }
        }
        return JSON.stringify(res);
      }
      return flattenObj(JSON.parse(input));
      ''';
    create temp function extract_key_values(input string) returns array<struct<key string, value string>> language js as r"""
        var parsed = JSON.parse(input);
        var keys = Object.keys(parsed);
        var result = [];
        for (var ii = 0; ii < keys.length; ii++) {
            var o = {key: keys[ii], value: parsed[keys[ii]]};
            result.push(o);
        }
        return result;
    """;
    WITH extract_all AS (
      select 
        id, 
        field,
        created,
        properties
        FROM 
          UNNEST([
            --STRUCT<id int, field string, created DATE, properties string>(1, 'michael', DATE(2022, 5, 1), '[[{"name":"Andy","age":7},{"name":"Mark","age":5},{"name":"Courtney","age":6}], [{"name":"Austin","age":8},{"name":"Erik","age":6},{"name":"Michaela","age":6}]]'),
            STRUCT<id int, field string, created DATE, properties string>(2, 'sarah', DATE(2022, 5, 2), '[{"name":"Angela","age":9},{"name":"Ryan","age":7},{"name":"Andrew","age":7}]'),
            STRUCT<id int, field string, created DATE, properties string>(3, 'rosy', DATE(2022, 5, 3), '[{"name":"Brynn","age":4},{"name":"Cameron","age":3},{"name":"Rebecca","age":5}]')
          ])
        AS myData
    )
    
    SELECT 
      id,
      field,
      created,
      key,
      value
    FROM (
      SELECT 
        *
      FROM extract_all
      CROSS JOIN UNNEST(extract_key_values(extract_all_leaves(properties)))
    )
    

    And I believe this yields a result more like what you're seeking:

    id field created key value
    2 sarah 2022-05-02 0.name Angela
    2 sarah 2022-05-02 0.age 9
    2 sarah 2022-05-02 1.name Ryan
    2 sarah 2022-05-02 1.age 7
    2 sarah 2022-05-02 2.name Andrew
    2 sarah 2022-05-02 2.age 7
    3 rosy 2022-05-03 0.name Brynn
    3 rosy 2022-05-03 0.age 4
    3 rosy 2022-05-03 1.name Cameron
    3 rosy 2022-05-03 1.age 3
    3 rosy 2022-05-03 2.name Rebecca
    3 rosy 2022-05-03 2.age 5

    Of course, if this isn't at all in the right place of where you're trying to get to.