Search code examples
arrayshivetype-conversionhiveqlcollect

Preserve order while converting string array into int array in hive


I'm trying to convert a string array to int array by keeping the original order here is a sample of what my data looks like:

id       attribut                       string_array
id1      attribut1, 10283:990000       ["10283","990000"]
id2      attribut2, 10283:36741000     ["10283","36741000"]
id3      attribut3, 10283:37871000     ["10283","37871000"]
id4      attribut4, 3215:90451000      ["3215","90451000"]

and here's how i convert the field "string_array" into an array of integers

select  
id, 
attribut,
string_array,
collect_list(cast(array_explode as int)),
from table
lateral view outer explode(string_array) r as array_explode

it gives me:

id       attribut                        string_array              int_array
id1      attribut1,10283:990000         ["10283","990000"]        [990000,10283]
id2      attribut2,10283:36741000       ["10283","36741000"]      [10283,36741000]
id3      attribut3,10283:37871000       ["10283","37871000"]      [37871000,10283]
id4      attribut4,3215:90451000        ["3215","90451000"]       [90451000,3215]

As you can see, the order in "string array" has not been preserved in "int_array" and I need it to be exactly the same as in "string_array". anyone know how to achieve this ?

Any help would be much appreciated


Solution

  • For Hive: Use posexplode, in a subquery before collect_list do distribute by id sort by position

    select  
    id, 
    attribut,
    string_array,
    collect_list(cast(element as int)),
    from
    (select * 
      from table t
           lateral view outer posexplode(string_array) e as pos,element 
      distribute by t.id, attribut, string_array -- distribute by group key
      sort by pos        -- sort by initial position
    ) t
    group by id, attribut, string_array
    

    Another way is to extract substring from your attributes and split without exploding (as you asked in the comment)

    select split(regexp_extract(attribut, '[^,]+,(.*)$',1),':')
    

    Regexp '[^,]+,(.*)$' means:

    [^,]+ - not a comma 1+ times , - comma (.*)$ - everything else in catpturing group 1 after comma till the end of the string

    Demo:

    select split(regexp_extract('attribut3,10283:37871000', '[^,]+,(.*)$',1),':')
    

    Result:

    ["10283","37871000"]