Search code examples
jsonhadoophivehiveql

How to query struct array with Hive (get_json_object)?


I store the following JSON objects in a Hive table:

{
  "main_id": "qwert",
  "features": [
    {
      "scope": "scope1",
      "name": "foo",
      "value": "ab12345",
      "age": 50,
      "somelist": ["abcde","fghij"]
    },
    {
      "scope": "scope2",
      "name": "bar",
      "value": "cd67890"
    },
    {
      "scope": "scope3",
      "name": "baz",
      "value": [
        "A",
        "B",
        "C"
      ]
    }
  ]
}

"features" is an array of varying length, i.e. all objects are optional. The objects have arbitrary elements, but all of them contain "scope", "name" and "value".

This is the Hive table I created:

CREATE TABLE tbl(
main_id STRING,features array<struct<scope:STRING,name:STRING,value:array<STRING>,age:INT,somelist:array<STRING>>>
)

I need a Hive query that returns the main_id and the value of the struct with the name "baz", i.e.,

main_id baz_value
qwert ["A","B","C"]

My problem is that the Hive UDF "get_json_object" supports only a limited version of JSONPath. It does not support a path like get_json_object(features, '$.features[?(@.name='baz')]').

How can query the wanted result with Hive? Is it maybe easier with another Hive table structure?


Solution

  • I found a solution for this:

    Use the Hive explode UDTF to explode the struct array, i.e., create a second (temporary) table with one record for each struct in the array "features".

    CREATE TABLE tbl_exploded as
    select main_id, 
    f.name as f_name,
    f.value as f_value
    from tbl
    LATERAL VIEW explode(features) exploded_table as f
    -- optionally filter here instead of in 2nd query:
    -- where f.name = 'baz'; 
    

    The result of this is:

    qwert, foo, ab12345
    qwert, bar, cd67890
    qwert, baz, ["A","B","C"]
    

    Now you can select the main_id and value like this:

    select main_id, f_value from tbl_exploded where f_name = 'baz';