Search code examples
sqlamazon-web-servicesamazon-athenaprestotrino

How to convert long JSON string into seperate columns via a query


I have a table with one of the columns having a long string of data in what looks like JSON format. I would like to break this data into additional columns via a query. This is in Athena.

Example of the data and its format in said column:

Site Name Server Date Data Load_DT
stuff Stuff stuff 2023-10-05 {"List": [{"cpeIndex": "47392", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47395", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "0xx.xx.xx.xx.xx", "rpdMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47397", "cpeIp": "x110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}]} 2023-10-06

There are other columns that share the same data and as such, the "cpeIndex" field here would be the start of a new row with its accomponying data and the other existing columns would just carry over the same data to the other rows. So in the example above, there would be three rows.

So at the end of the day, the table after executing a query on it should return results such as

Site Name Server Date cpeIndex cpeIP cpestatus macaddress apemacaddress rpdUSrfportnum Load_dt
Data Data Data Date 47392 110.100.100.1 Online xx.xx.xx.xx.xx xx.xx.xx.xx.xx 0 2023-10-05
Data Data Data Date 47395 110.100.100.1 Online xx.xx.xx.xx.xx xx.xx.xx.xx.xx 0 2023-10-05
Data Data Data Date 47397 110.100.100.1 Online xx.xx.xx.xx.xx xx.xx.xx.xx.xx 0 2023-10-05

Solution

  • You can process it using json functions, casting to array of map and unnest:

    -- sample data
    with dataset(data) as (
        values ('{"List": [{"cpeIndex": "47392", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47395", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "0xx.xx.xx.xx.xx", "rpdMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47397", "cpeIp": "x110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}]}')
    )
    
    -- query
    select m['cpeIndex'] cpeIndex,
           m['cpeIp'] cpeIP,
           m['cpeStatus'] cpeIndex,
           m['macAddress'] macaddress,
           try(m['apeMacAddress']) apemacaddress, -- one JSON object is missing this prop
           m['rpdUsrfPortnum'] rpdUSrfportnum
    from dataset,
        unnest(cast(json_extract(data, '$.List') as array(map(varchar, json)))) as t(m);
    

    If not all keys present in all objects you can also cast to 'json':

    -- query
    select json_extract_scalar(m, '$.cpeIndex') cpeIndex,
           json_extract_scalar(m, '$.cpeIp') cpeIP,
           json_extract_scalar(m, '$.cpeStatus') cpeIndex,
           json_extract_scalar(m, '$.macAddress') macaddress,
           json_extract_scalar(m, '$.apeMacAddress') apemacaddress,
           json_extract_scalar(m, '$.rpdUsrfPortnum') rpdUSrfportnum
    from dataset,
        unnest(cast(json_extract(data, '$.List') as array(json))) as t(m);
    

    Output:

    cpeIndex cpeIP cpeIndex macaddress apemacaddress rpdUSrfportnum
    47392 110.110.100.1 Online xx.xx.xx.xx.xx xx.xx.xx.xx.xx 0
    47395 110.110.100.1 Online 0xx.xx.xx.xx.xx NULL 0
    47397 x110.110.100.1 Online xx.xx.xx.xx.xx xx.xx.xx.xx.xx 0