Search code examples
sqljsonhivehiveql

Hive - Reformat data structure


So I have a sample of Hive data:

Customer xx_var yy_var branchflow
{"customer_no":"239230293892839892","acct":["2324325","23425345"]} 23 3 [{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]

And I want to transform it into something like this:

Customer_no acct xx_var yy_var branchflow
239230293892839892 2324325 23 3 [1,2,3,4,5,6,6,6,4]
239230293892839892 23425345 23 3 [1,2,3,4,5,6,6,6,99,4]

I have tried using this query, but getting the wrong output format.

SELECT 
    customer.customer_no,
    acct,
    xx_var,
    yy_var,
    bi_acctno,
    values_bi
FROM
    struct_test 
LATERAL VIEW explode(customer.acct) acct AS acctno
LATERAL VIEW explode(brancflow.acctno) bia as bi_acctno
LATERAL VIEW explode(brancflow.value) biv as values_bi
WHERE bi_acctno = acctno

Does anyone know how to approach this problem?


Solution

  • Use json_tuple to extract JSON elements. In case of array, it returns it also as string: remove square brackets, split and explode. See comments in the demo code.

    Demo:

    with mytable as (--demo data, use your table instead of this CTE
    select '{"customer_no":"239230293892839892","acct":["2324325","23425345"]}' as customer,    
           23 xx_var,   3 yy_var,   
           '[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]' branchflow
    )
    
    select c.customer_no, 
           a.acct,  
           t.xx_var,    t.yy_var, 
           get_json_object(b.acct_branchflow,'$.value') value
      from mytable t
           --extract customer_no and acct array
           lateral view json_tuple(t.customer, 'customer_no', 'acct') c as customer_no, accts
           --remove [] and " and explode array of acct
           lateral view explode(split(regexp_replace(c.accts,'^\\[|"|\\]$',''),',')) a as acct
           --remove [] and explode array of json
           lateral view explode(split(regexp_replace(t.branchflow,'^\\[|\\]$',''),'(?<=\\}),(?=\\{)')) b as acct_branchflow
    --this will remove duplicates after lateral view: need only matching acct
     where get_json_object(b.acct_branchflow,'$.acctno') = a.acct
    

    Result:

    customer_no         acct        xx_var  yy_var  value
    239230293892839892  2324325     23      3       [1,2,3,4,5,6,6,6,4]
    239230293892839892  23425345    23      3       [1,2,3,4,5,6,6,6,99,4]