Search code examples
stringhadoopjoinhiveimpala

Efficient way to join big table by sub strings using hive/ impala or other ways


I have 2 tables tabl1:

+-------+--------+--------+----------+
| att1  |  att2  | att3   | att4     |
+-------+--------+--------+----------+
|  abcd | ava012 | df012f | afsdaldf |
.......

and tabl2:

+----+
| val|
+----+
| 012|
...

tabl2 contains numbers that can be substring in one or more of the 4 columns of tabl1. The both tables are big tables contains millions of records. I tried to concatenate tabl1 columns and search in it but the query never ends. Is there an efficient way to do it. Maybe to transform the whole table into one txt file and search in it? Following also this question Here are some examples of my trials (both in Hive):

SELECT a.*, b.*
from tabl1 a, tabl2 b
where  
instr (
concat ( (cast (a.att1 as string), (cast (a.att2 as string), 
(cast (a.att3 as string), (cast (a.att4 as string) ) , (cast (b.val as string) ) ) > 0

or

  SELECT a.*, b.*
    from tabl1 a, tabl2 b
    where  
    concat ( (cast (a.att1 as string), (cast (a.att2 as string), 
(cast (a.att3 as string), (cast (a.att4 as string) ) 
like  concat ('%',(cast (b.val as string),'%')

There are some more with REGEX but endless runtime...


Solution

  • select  *
    
    from           (select  *
                    from    tabl1 t1
                            lateral view explode(split(regexp_replace(trim(regexp_replace(concat_ws(',',att1,att2,att3,att4),'\\D+',' ')),'(?<=^| )(?<token>.*?) (?=.*(?<= )\\k<token>(?= |$))',''),' ')) e as val
                    ) t1
    
            join    tabl2 t2
    
            on      t2.val = 
                    t1.val