Search code examples
apache-spark-sqlhiveql

Merging records by 1+ common elements


I have a hive table with the following schema:

int key1 # this is unique
array<int> key2_list

Now I want to merge records here if their key2_lists have any common element(s). For example, if record A has (10, [1,3,5]) and B has (12, [1,2,4]), I want to merge it as ([10,12], [1,2,3,4,5]) or ([1,2,3,4,5]).

If it is easier, the input table can have the following schema instead:

int key1
int key2

I'd prefer to do this through Hive or SparkSQL.


Solution

  • For your second table definition you can run something like

    select collect_list(key1) from yourtable group by key2
    

    Or for your first table definition

    select collect_list(key1) from 
    (select key1,  mykey FROM yourtable LATERAL VIEW explode(key2_list) a AS mykey) t
    group by mykey