Search code examples
sqlarrayshivehiveqlexplode

Multiple lateral view and explode in hive produce rows duplication


I am working with 4 columns

Ref_No Currency Amount Tag
EBDR001 usd^usd^usd^usd^ 240^300^210^500^ DBC^ODA^ICA^DRA

I Want data in this format

Ref_No Currency Amount Tag
EBDR001 usd 240 DBC
EBDR001 usd 300 ODA
EBDR001 usd 210 ICA
EBDR001 usd 500 DRA

The result I am getting

Ref_No Currency Amount Tag
EBDR001 usd 240 DBC
EBDR001 usd 240 DBC
EBDR001 usd 240 DBC
EBDR001 usd 240 DBC
EBDR001 usd 300 ODA
EBDR001 usd 300 ODA
EBDR001 usd 300 ODA
EBDR001 usd 300 ODA
EBDR001 usd 210 ICA
EBDR001 usd 210 ICA
EBDR001 usd 210 ICA
EBDR001 usd 210 ICA
EBDR001 usd 500 DRA
EBDR001 usd 500 DRA
EBDR001 usd 500 DRA
EBDR001 usd 500 DRA

Like this there are over thousand of rows with different Ref_No.

The query I am using is

select Ref,ccy,amt,tag_1 from table1
lateral view explode(split(ccy,"\\^")) myTable12 as ccy
lateral view explode(split(amt,"\\^")) myTable13 as amt
lateral view explode(split(tag_1 ,"\\^")) myTable14 as tag_1



Solution

  • Multiple lateral view produce Cartesian product. The solution is to use single lateral view posexplode, split other columns to get arrays, then use position to address other values.

    Some columns contain extra ^ delimiter at the end, remove it before splitting.

    Demo:

    with table1 as (--Data example, use your table instead
    select 'EBDR001' Ref_No, 'usd^usd^usd^usd^' Currency, '240^300^210^500^' Amount, 'DBC^ODA^ICA^DRA' Tag
    )
    
    select Ref_No, 
           c.ccy, 
           split(t1.Amount,"\\^")[c.pos] amt, 
           split(t1.Tag,"\\^")[c.pos]    tag
      from ( --Remove extra delimiter at the end
            select Ref_No, 
                   regexp_replace(Currency,'\\^$','') Currency, 
                   regexp_replace(Amount,'\\^$','')   Amount,
                   regexp_replace(Tag,'\\^$','')      Tag
              from table1
           ) t1
           lateral view posexplode(split(t1.Currency,"\\^")) c as pos, ccy 
    

    Result:

    ref_no  c.ccy amt   tag
    EBDR001 usd   240   DBC
    EBDR001 usd   300   ODA
    EBDR001 usd   210   ICA
    EBDR001 usd   500   DRA
    

    Possible alternatives:

    You can use posexplode instead of explode in three lateral views and WHERE clause to match positions of three lateral views, see https://stackoverflow.com/a/59497463/2700344

    Also you can use three CTEs in which you posexplode each array, then LEFT JOIN them with main table ON Ref and Pos, similar solution is here: https://stackoverflow.com/a/56244552/2700344.