Search code examples
sqlhadoophiveimpala

Hadoop - Hive - Impala - rewrite a query for performance


I have 2 tables with below columns

Table1

col1   col2   col3     val
11     221    38       10
null   90     null     989
78     90     null     77

table2

col1   col2   col3  
12     221    78
23     null   67 
78     90     null

I want to join these 2 tables first on col1 if values matched then stop if not join on col2 if matches stop else join on col3 and populate val if any of column matches else null and whichever columns matching then populate that column in matchingcol column. So, the output should look like this:

col1   col2   col3     val     matchingcol
11     221    38       10      col2
null   90     null     null    null
78     90     null     77      col1

I was able to do this using below query, but the performance is very slow. Please let me know if there is any better way of writing below for faster performance

select *
from table1 t1 left join
     table2 t2_1
     on t2_1.col1 = t1.col1 left join
     table2 t2_2
     on t2_2.col2 = t1.col2 and t2_1.col1 
     left join table2 t2_3 on t2_3.col3 = t1.col3 and t2_2.col2 is null

ps: I asked same question before but there was no better answer


Solution

  • What you describe is:

    select t1.col1, t1.col2, t1.col3, 
           (case when t2_1.col1 is not null or t2_2.col1 is not null or t2_3.col1 is not null then t1.val end) as val
           (case when t2_1.col1 is not null then 'col1'
                 when t2_2.col2 is not null then 'col2'
                 when t2_3.col3 is not null then 'col3'
            end) as matching
    from table1 t1 left join
         table2 t2_1
         on t2_1.col1 = t1.col1 left join
         table2 t2_2
         on t2_2.col2 = t1.col2 and t2_1.col1 is null left join
         table2 t2_3
         on t2_3.col3 = t1.col3 and t2_2.col2 is null;
    

    This is probably the best approach.