Search code examples
sqljsongoogle-bigqueryprestoamazon-athena

How do I Unnest varchar to json in Athena


I am crawling data from Google Big Query and staging them into Athena. One of the columns crawled as string, contains json :

{
   "key": "Category",
   "value": {
        "string_value": "something"
}

I need to unnest these and flatten them to be able to use them in a query. I require key and string value (so in my query it will be where Category = something

I have tried the following :

WITH dataset AS (
  SELECT cast(json_column as json) as json_column
  from "thedatabase"
  LIMIT 10
)
SELECT
    json_extract_scalar(json_column, '$.value.string_value') AS string_value
FROM dataset

which is returning null. Casting the json_column as json adds \ into them :

"[{\"key\":\"something\",\"value\":{\"string_value\":\"app\"}}

If I use replace on the json, it doesn't allow me as it's not a varchar object.

So how do I extract the values from the some_column field?


Solution

    1. Presto's json_extract_scalar actually supports extracting just from the varchar (string) value :
    -- sample data
    WITH dataset(json_column) AS (
     values ('{
       "key": "Category",
       "value": {
            "string_value": "something"
    }}')
    )
    
    --query
    SELECT
        json_extract_scalar(json_column, '$.value.string_value') AS string_value
    FROM dataset;
    

    Output:

    string_value
    something
    1. Casting to json will encode data as json (in case of string you will get a double encoded one), not parse it, use json_parse (in this particular case it is not needed, but there are cases when you will want to use it):
    -- query
    SELECT
        json_extract_scalar(json_parse(json_column), '$.value.string_value') AS string_value
    FROM dataset;