Search code examples
sqljsonamazon-athenaprestotrino

Create new columns from String Columns that looks like JSON


I got a table with 1 column which is a string type, but the inside looks like an json type.

The value looks like this

'old_id' column with value

[{"name":"Entitas Penugasan","id":"6415","value":"HIJRA"},
 {"name":"Function","id":"10594","value":"People & Culture"},
 {"name":"Unit","id":"10595","value":"Organization Development"},
 {"name":"Tribe","id":"10602","value":"Shared Service"}
]

'new_id' column with value

[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
 {"name":"Function","id":"10594","value":"Finance"},
 {"name":"Unit","id":"10595","value":"Finance Operations"},
 {"name":"Tribe","id":"10602","value":"Commercial"}
]

I need sql athena query to make columns old_name, old_id, old_value, new_name, new_id, new_value from those json columns

I've tried using

REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value

but it only generates 1 row, even though in the columns, it shows 4 'array'

The query should generate 4 rows, looks like below enter image description here

id old_name new_name old_id new_id old_value new_value
A Entitas Penugasan Entitas Penugasan 6415 6415 HIJRA AFS
A Function Function 10594 10594 People & Culture Finance
A Unit Unit 10595 10595 Organization Development Finance Operations
A Tribe Tribe 10602 10602 Shared Service Commercial

Is there any way to do it in SQL Athena?


Edit: I did a little progress with the query below

with raw_data as(
select id, user_id, old_custom_fields, new_custom_fields
    from my_table
    where 
    -- new_custom_fields <> '' and new_custom_fields<> 'None' and new_custom_fields is not null and
    id in (A)
),
splitted_data as (
    SELECT id, user_id,
    split(old_custom_fields, '},{') AS old_custom_field_id,
    split(new_custom_fields, '},{') AS new_custom_field_id
  FROM my_table
),
old_custom_field_id_unnest as (
SELECT
  *
 from splitted_data
CROSS JOIN UNNEST(old_custom_field_id) AS t (_old_custom_fields)
),
new_custom_field_id_unnest as (
SELECT
  *
 from splitted_data
CROSS JOIN UNNEST(new_custom_field_id) AS t (_new_custom_fields)
),
old_custom_field_cleaned as (
    select id, old_custom_field_id,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value
    from old_custom_field_id_unnest
),
new_custom_field_cleaned as (
    select id, new_custom_field_id,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
    from new_custom_field_id_unnest
)
select oc.id, old_name, new_name,
    old_id,new_id,
    old_value,new_value
    from old_custom_field_cleaned oc 
    join new_custom_field_cleaned nc on oc.id = nc.id

But this leads to duplicated rows, now I have 16 rows due to the join, still needs help in removing the unneeded rows enter image description here


Solution

  • Your data does not just looks like JSON, the sample data is JSON, so process it as one. Depending on Presto/Trino version which is used the actual handling can be different but the common part would always be - parse JSON and cast it to array of some type and then unnest. For example you can use ROW(name varchar, id varchar, value varchar) (alternatives can be just JSON or MAP(varchar, varchar) or MAP(varchar, JSON) depending on the case):

    -- sample data
    WITH dataset(old_id, new_id) AS (
        VALUES
            ('[{"id":"6415","value":"HIJRA", "name":"Entitas Penugasan"},
                 {"name":"Function","id":"10594","value":"People & Culture"},
                 {"name":"Unit","id":"10595","value":"Organization Development"},
                 {"name":"Tribe","id":"10602","value":"Shared Service"}
              ]',
              '[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
                 {"name":"Function","id":"10594","value":"Finance"},
                 {"name":"Unit","id":"10595","value":"Finance Operations"},
                 {"name":"Tribe","id":"10602","value":"Commercial"}
                ]'
             )
    )
    
    -- query
    SELECT t.*
    FROM dataset,
    unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar))),
        cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(old_name, old_id, old_value, new_name, new_id, new_value); -- maybe as t(old, new) depending on engine and select t.old.name as old_name, ...
    

    Output:

    old_name old_id old_value new_name new_id new_value
    Entitas Penugasan 6415 HIJRA Entitas Penugasan 6415 HIJRA
    Function 10594 People & Culture Function 10594 People & Culture
    Unit 10595 Organization Development Unit 10595 Organization Development
    Tribe 10602 Shared Service Tribe 10602 Shared Service

    The above code comes from assumption that arrays have "same" data in correct order. Personally I would consider joining based on the id:

    -- sample data
    WITH dataset(old_id, new_id) AS (
       -- ..
    ),
    
    -- query 
    old_values as (
        SELECT t.*
        FROM dataset,
        unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
    ),
    new_values as (
        SELECT t.*
        FROM dataset,
        unnest(cast(json_parse(new_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
    )
    
    SELECT o.id,
           o.name old_name,
           o.value old_value,
           n.name new_name,
           n.value new_value
    FROM old_values as o
    full outer join new_values as n on o.id = n.id;
    

    Output:

    id old_name old_value new_name new_value
    6415 Entitas Penugasan HIJRA Entitas Penugasan AFS
    10594 Function People & Culture Function Finance
    10595 Unit Organization Development Unit Finance Operations
    10602 Tribe Shared Service Tribe Commercial