sql amazon-web-services amazon-athena presto trino

How to convert long JSON string into seperate columns via a query

I have a table with one of the columns having a long string of data in what looks like JSON format. I would like to break this data into additional columns via a query. This is in Athena.

Example of the data and its format in said column:

Site	Name	Server	Date	Data	Load_DT
stuff	Stuff	stuff	2023-10-05	{"List": [{"cpeIndex": "47392", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47395", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "0xx.xx.xx.xx.xx", "rpdMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47397", "cpeIp": "x110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}]}	2023-10-06

There are other columns that share the same data and as such, the "cpeIndex" field here would be the start of a new row with its accomponying data and the other existing columns would just carry over the same data to the other rows. So in the example above, there would be three rows.

So at the end of the day, the table after executing a query on it should return results such as

Site	Name	Server	Date	cpeIndex	cpeIP	cpestatus	macaddress	apemacaddress	Load_dt
Data	Data	Data	Date	47392	110.100.100.1	Online	xx.xx.xx.xx.xx	xx.xx.xx.xx.xx	2023-10-05
Data	Data	Data	Date	47395	110.100.100.1	Online	xx.xx.xx.xx.xx	xx.xx.xx.xx.xx	2023-10-05
Data	Data	Data	Date	47397	110.100.100.1	Online	xx.xx.xx.xx.xx	xx.xx.xx.xx.xx	2023-10-05

Solution

You can process it using json functions, casting to array of map and unnest:

-- sample data
with dataset(data) as (
    values ('{"List": [{"cpeIndex": "47392", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47395", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "0xx.xx.xx.xx.xx", "rpdMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47397", "cpeIp": "x110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}]}')
)

-- query
select m['cpeIndex'] cpeIndex,
       m['cpeIp'] cpeIP,
       m['cpeStatus'] cpeIndex,
       m['macAddress'] macaddress,
       try(m['apeMacAddress']) apemacaddress, -- one JSON object is missing this prop
       m['rpdUsrfPortnum'] rpdUSrfportnum
from dataset,
    unnest(cast(json_extract(data, '$.List') as array(map(varchar, json)))) as t(m);

If not all keys present in all objects you can also cast to 'json':

-- query
select json_extract_scalar(m, '$.cpeIndex') cpeIndex,
       json_extract_scalar(m, '$.cpeIp') cpeIP,
       json_extract_scalar(m, '$.cpeStatus') cpeIndex,
       json_extract_scalar(m, '$.macAddress') macaddress,
       json_extract_scalar(m, '$.apeMacAddress') apemacaddress,
       json_extract_scalar(m, '$.rpdUsrfPortnum') rpdUSrfportnum
from dataset,
    unnest(cast(json_extract(data, '$.List') as array(json))) as t(m);

Output:

cpeIndex	cpeIP	cpeIndex	macaddress	apemacaddress
47392	110.110.100.1	Online	xx.xx.xx.xx.xx	xx.xx.xx.xx.xx
47395	110.110.100.1	Online	0xx.xx.xx.xx.xx	NULL
47397	x110.110.100.1	Online	xx.xx.xx.xx.xx	xx.xx.xx.xx.xx