Search code examples
sqljsonapache-spark-sqldatabricksdatabricks-sql

How can I extract dynamic number of key value pairs from a JSON column in spark SQL


I'm trying to parse out test-N-value from my JSON column. I have tried using wildcard but the column created just returns NULL here.

SELECT get_json_object('{"random_field": "ABC", "test-0-value":"1", "test-1-value":"2"}', '$.test-*-value');

Does anyone have any tips as to how to extract out these dynamic objects from this? Expected result would be.. but remember the output has to allow dynamic number of key value pairs.

extracted_out
-------------
{"test-0-value":"1", "test-1-value":"2"}

Solution

  • You can convert json to map using from_json, explode map to get key, value, filter only keys which you need,
    collect array of key:value, concatenate array with comma delimiter, use str_to_map to get map containing keys which you need, then you can optionally convert it to JSON string using to_json

    Demo:

    with src as (select '{"random_field": "ABC", "test-0-value":"1", "test-1-value":"2"}' as original_json)
    
    select s.original_json, to_json( str_to_map(concat_ws(',',collect_list(concat(key,':',value))))) result
    from src s
         lateral view  explode(from_json(s.original_json, 'map<STRING, STRING>'))  m as key, value
    where key rlike 'test-\\d+-value'  
    group by s.original_json
    """).show(100, false)
    

    Result:

    +---------------------------------------------------------------+---------------------------------------+
    |original_json                                                  |result                                 |
    +---------------------------------------------------------------+---------------------------------------+
    |{"random_field": "ABC", "test-0-value":"1", "test-1-value":"2"}|{"test-0-value":"1","test-1-value":"2"}|
    +---------------------------------------------------------------+---------------------------------------+
    

    Update:

    Found much better solution using map_filter

    spark.sql("""
    with src as (select '{"random_field": "ABC", "test-0-value":"1", "test-1-value":"2"}' as original_json)
    
    select s.original_json, to_json( map_filter(from_json(s.original_json, 'map<STRING, STRING>'), (k, v) -> k rlike 'test-\\d+-value')) result
    from src s
    """).show(100, false)