I'm trying to convert a string array to int array by keeping the original order here is a sample of what my data looks like:
id attribut string_array
id1 attribut1, 10283:990000 ["10283","990000"]
id2 attribut2, 10283:36741000 ["10283","36741000"]
id3 attribut3, 10283:37871000 ["10283","37871000"]
id4 attribut4, 3215:90451000 ["3215","90451000"]
and here's how i convert the field "string_array" into an array of integers
select
id,
attribut,
string_array,
collect_list(cast(array_explode as int)),
from table
lateral view outer explode(string_array) r as array_explode
it gives me:
id attribut string_array int_array
id1 attribut1,10283:990000 ["10283","990000"] [990000,10283]
id2 attribut2,10283:36741000 ["10283","36741000"] [10283,36741000]
id3 attribut3,10283:37871000 ["10283","37871000"] [37871000,10283]
id4 attribut4,3215:90451000 ["3215","90451000"] [90451000,3215]
As you can see, the order in "string array" has not been preserved in "int_array" and I need it to be exactly the same as in "string_array". anyone know how to achieve this ?
Any help would be much appreciated
For Hive: Use posexplode, in a subquery before collect_list do distribute by id sort by position
select
id,
attribut,
string_array,
collect_list(cast(element as int)),
from
(select *
from table t
lateral view outer posexplode(string_array) e as pos,element
distribute by t.id, attribut, string_array -- distribute by group key
sort by pos -- sort by initial position
) t
group by id, attribut, string_array
Another way is to extract substring from your attributes and split without exploding (as you asked in the comment)
select split(regexp_extract(attribut, '[^,]+,(.*)$',1),':')
Regexp '[^,]+,(.*)$' means:
[^,]+
- not a comma 1+ times
,
- comma
(.*)$
- everything else in catpturing group 1 after comma till the end of the string
Demo:
select split(regexp_extract('attribut3,10283:37871000', '[^,]+,(.*)$',1),':')
Result:
["10283","37871000"]