Search code examples
sqlhivehiveql

Hive: Find unique values if condition met between two tables


I have two tables. Table 1 has all the unique places (30 rows) that I am interested in:

places
japan
china
india
...

Table 2 has all the information for IDs, places visited and date.

id places date
10001 japan 20210204
10001 australia 20210204
10001 china 20210204
10001 argentina 20210205
10002 spain 20210204
10002 india 20210204
10002 china 20210205
10003 argentina 20210204
10003 portugal 20210204

What I am interested to get is:

  • For a specific date (let's say 20210204)
  • Find all unique IDs from Table 2 that has visited at least one of the places from Table 1
  • Save those unique IDs to a temp table.

Here is what I have tried:

create temporary table imp.unique_ids_tmp
as select distinct(final.id) from
(select t2.id
from table2 as t2
where t2.date = '20210204'
and t2.places in 
(select * from table1)) final;

I am struggling to incorporate the "at least one" logic such that once a satisfying id is found, it stops looking at those id records.


Solution

  • Use left semi join (implements uncorrelated EXISTS in efficient way), it will filter only records joined, after that apply distinct:

    create temporary table imp.unique_ids_tmp as
    select distinct t2.id --distinct is not a function, do not need ()
      from table2 t2
           left semi join table1 t1 on t2.places = t1.places
     where t2.date = '20210204'
    ;
    

    At "least once" condition will be satisfied: IDs which have no joined records will not present in the dataset.

    Another way is to use correlated EXISTS:

    create temporary table imp.unique_ids_tmp as
    select distinct t2.id --distinct is not a function, do not need ()
      from table2 t2
     where t2.date = '20210204' 
       --this condition is true as soon as one match is found
       and exists (select 1 from table1 t1 where t2.places = t1.places)
    ;
    

    IN also will work.

    Correllated EXIST looks close to "once a satisfying id is found, it stops looking at those id records", but all these methods are implemented using JOIN in Hive. Execute EXPLAIN and you will see, it will be the same plan generated, though it depends on implementation in your version. Potentially EXISTS can be faster because do not need to check all records in the subquery. Taking into account that your table1 with 30 rows is small enough to fit in memory, MAP-JOIN (set hive.auto.convert.join=true;) will give you the best performance.

    One more fast method using array or IN(static_list). It can be used for small and static arrays. Ordered array may give you better performance:

    select distinct t2.id --distinct is not a function, do not need ()
      from table2 t2
     where t2.date = '20210204'
           and array_contains(array('australia', 'china', 'japan', ... ), t2.places)
           --OR use t2.places IN ('australia', 'china', 'japan', ... )
    

    Why this method is faster: Because no need to start mapper and calculate splits to read table from hdfs, only table2 will be read. The drawback is that list of values is static. On the other hand, you can pass whole list as a parameter, see here.