I have a table with one of the columns having a long string of data in what looks like JSON format. I would like to break this data into additional columns via a query. This is in Athena.
Example of the data and its format in said column:
Site | Name | Server | Date | Data | Load_DT |
---|---|---|---|---|---|
stuff | Stuff | stuff | 2023-10-05 | {"List": [{"cpeIndex": "47392", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47395", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "0xx.xx.xx.xx.xx", "rpdMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47397", "cpeIp": "x110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}]} | 2023-10-06 |
There are other columns that share the same data and as such, the "cpeIndex" field here would be the start of a new row with its accomponying data and the other existing columns would just carry over the same data to the other rows. So in the example above, there would be three rows.
So at the end of the day, the table after executing a query on it should return results such as
Site | Name | Server | Date | cpeIndex | cpeIP | cpestatus | macaddress | apemacaddress | rpdUSrfportnum | Load_dt |
---|---|---|---|---|---|---|---|---|---|---|
Data | Data | Data | Date | 47392 | 110.100.100.1 | Online | xx.xx.xx.xx.xx | xx.xx.xx.xx.xx | 0 | 2023-10-05 |
Data | Data | Data | Date | 47395 | 110.100.100.1 | Online | xx.xx.xx.xx.xx | xx.xx.xx.xx.xx | 0 | 2023-10-05 |
Data | Data | Data | Date | 47397 | 110.100.100.1 | Online | xx.xx.xx.xx.xx | xx.xx.xx.xx.xx | 0 | 2023-10-05 |
You can process it using json functions, casting to array of map and unnest:
-- sample data
with dataset(data) as (
values ('{"List": [{"cpeIndex": "47392", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47395", "cpeIp": "110.110.100.1", "cpeStatus": "Online", "macAddress": "0xx.xx.xx.xx.xx", "rpdMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}, {"cpeIndex": "47397", "cpeIp": "x110.110.100.1", "cpeStatus": "Online", "macAddress": "xx.xx.xx.xx.xx", "apeMacAddress": "xx.xx.xx.xx.xx", "rpdUsrfPortnum": "0"}]}')
)
-- query
select m['cpeIndex'] cpeIndex,
m['cpeIp'] cpeIP,
m['cpeStatus'] cpeIndex,
m['macAddress'] macaddress,
try(m['apeMacAddress']) apemacaddress, -- one JSON object is missing this prop
m['rpdUsrfPortnum'] rpdUSrfportnum
from dataset,
unnest(cast(json_extract(data, '$.List') as array(map(varchar, json)))) as t(m);
If not all keys present in all objects you can also cast to 'json':
-- query
select json_extract_scalar(m, '$.cpeIndex') cpeIndex,
json_extract_scalar(m, '$.cpeIp') cpeIP,
json_extract_scalar(m, '$.cpeStatus') cpeIndex,
json_extract_scalar(m, '$.macAddress') macaddress,
json_extract_scalar(m, '$.apeMacAddress') apemacaddress,
json_extract_scalar(m, '$.rpdUsrfPortnum') rpdUSrfportnum
from dataset,
unnest(cast(json_extract(data, '$.List') as array(json))) as t(m);
Output:
cpeIndex | cpeIP | cpeIndex | macaddress | apemacaddress | rpdUSrfportnum |
---|---|---|---|---|---|
47392 | 110.110.100.1 | Online | xx.xx.xx.xx.xx | xx.xx.xx.xx.xx | 0 |
47395 | 110.110.100.1 | Online | 0xx.xx.xx.xx.xx | NULL | 0 |
47397 | x110.110.100.1 | Online | xx.xx.xx.xx.xx | xx.xx.xx.xx.xx | 0 |