I have a hive table with the following schema:
int key1 # this is unique
array<int> key2_list
Now I want to merge records here if their key2_lists have any common element(s). For example, if record A has (10, [1,3,5]) and B has (12, [1,2,4]), I want to merge it as ([10,12], [1,2,3,4,5]) or ([1,2,3,4,5]).
If it is easier, the input table can have the following schema instead:
int key1
int key2
I'd prefer to do this through Hive or SparkSQL.
For your second table definition you can run something like
select collect_list(key1) from yourtable group by key2
Or for your first table definition
select collect_list(key1) from
(select key1, mykey FROM yourtable LATERAL VIEW explode(key2_list) a AS mykey) t
group by mykey