I got a table with 1 column which is a string type, but the inside looks like an json type.
The value looks like this
'old_id' column with value
[{"name":"Entitas Penugasan","id":"6415","value":"HIJRA"},
{"name":"Function","id":"10594","value":"People & Culture"},
{"name":"Unit","id":"10595","value":"Organization Development"},
{"name":"Tribe","id":"10602","value":"Shared Service"}
]
'new_id' column with value
[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
{"name":"Function","id":"10594","value":"Finance"},
{"name":"Unit","id":"10595","value":"Finance Operations"},
{"name":"Tribe","id":"10602","value":"Commercial"}
]
I need sql athena query to make columns old_name, old_id, old_value, new_name, new_id, new_value from those json columns
I've tried using
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
but it only generates 1 row, even though in the columns, it shows 4 'array'
The query should generate 4 rows, looks like below
id | old_name | new_name | old_id | new_id | old_value | new_value |
---|---|---|---|---|---|---|
A | Entitas Penugasan | Entitas Penugasan | 6415 | 6415 | HIJRA | AFS |
A | Function | Function | 10594 | 10594 | People & Culture | Finance |
A | Unit | Unit | 10595 | 10595 | Organization Development | Finance Operations |
A | Tribe | Tribe | 10602 | 10602 | Shared Service | Commercial |
Is there any way to do it in SQL Athena?
Edit: I did a little progress with the query below
with raw_data as(
select id, user_id, old_custom_fields, new_custom_fields
from my_table
where
-- new_custom_fields <> '' and new_custom_fields<> 'None' and new_custom_fields is not null and
id in (A)
),
splitted_data as (
SELECT id, user_id,
split(old_custom_fields, '},{') AS old_custom_field_id,
split(new_custom_fields, '},{') AS new_custom_field_id
FROM my_table
),
old_custom_field_id_unnest as (
SELECT
*
from splitted_data
CROSS JOIN UNNEST(old_custom_field_id) AS t (_old_custom_fields)
),
new_custom_field_id_unnest as (
SELECT
*
from splitted_data
CROSS JOIN UNNEST(new_custom_field_id) AS t (_new_custom_fields)
),
old_custom_field_cleaned as (
select id, old_custom_field_id,
REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value
from old_custom_field_id_unnest
),
new_custom_field_cleaned as (
select id, new_custom_field_id,
REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
from new_custom_field_id_unnest
)
select oc.id, old_name, new_name,
old_id,new_id,
old_value,new_value
from old_custom_field_cleaned oc
join new_custom_field_cleaned nc on oc.id = nc.id
But this leads to duplicated rows, now I have 16 rows due to the join, still needs help in removing the unneeded rows
Your data does not just looks like JSON, the sample data is JSON, so process it as one. Depending on Presto/Trino version which is used the actual handling can be different but the common part would always be - parse JSON and cast it to array
of some type and then unnest. For example you can use ROW(name varchar, id varchar, value varchar)
(alternatives can be just JSON
or MAP(varchar, varchar)
or MAP(varchar, JSON)
depending on the case):
-- sample data
WITH dataset(old_id, new_id) AS (
VALUES
('[{"id":"6415","value":"HIJRA", "name":"Entitas Penugasan"},
{"name":"Function","id":"10594","value":"People & Culture"},
{"name":"Unit","id":"10595","value":"Organization Development"},
{"name":"Tribe","id":"10602","value":"Shared Service"}
]',
'[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
{"name":"Function","id":"10594","value":"Finance"},
{"name":"Unit","id":"10595","value":"Finance Operations"},
{"name":"Tribe","id":"10602","value":"Commercial"}
]'
)
)
-- query
SELECT t.*
FROM dataset,
unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar))),
cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(old_name, old_id, old_value, new_name, new_id, new_value); -- maybe as t(old, new) depending on engine and select t.old.name as old_name, ...
Output:
old_name | old_id | old_value | new_name | new_id | new_value |
---|---|---|---|---|---|
Entitas Penugasan | 6415 | HIJRA | Entitas Penugasan | 6415 | HIJRA |
Function | 10594 | People & Culture | Function | 10594 | People & Culture |
Unit | 10595 | Organization Development | Unit | 10595 | Organization Development |
Tribe | 10602 | Shared Service | Tribe | 10602 | Shared Service |
The above code comes from assumption that arrays have "same" data in correct order. Personally I would consider joining based on the id:
-- sample data
WITH dataset(old_id, new_id) AS (
-- ..
),
-- query
old_values as (
SELECT t.*
FROM dataset,
unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
),
new_values as (
SELECT t.*
FROM dataset,
unnest(cast(json_parse(new_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
)
SELECT o.id,
o.name old_name,
o.value old_value,
n.name new_name,
n.value new_value
FROM old_values as o
full outer join new_values as n on o.id = n.id;
Output:
id | old_name | old_value | new_name | new_value |
---|---|---|---|---|
6415 | Entitas Penugasan | HIJRA | Entitas Penugasan | AFS |
10594 | Function | People & Culture | Function | Finance |
10595 | Unit | Organization Development | Unit | Finance Operations |
10602 | Tribe | Shared Service | Tribe | Commercial |