Search code examples
sqlhadoopjoinhivehql

What's the purpose of a JOIN where no column from 2nd table is being used?


I am looking through some hive queries we are running as part of analytics on our hadoop cluster, but I am having trouble understanding one. This is the Hive QL query

SELECT 
    c_id, v_id, COUNT(DISTINCT(m_id)) AS participants, 
    cast(date_sub(current_date, ${window}) as string) as event_date 
from (
    select 
        a.c_id, a.v_id, a.user_id, 
        case 
            when c.id1 is not null and a.timestamp <= c.stitching_ts then c.id2 else a.m_id 
        end as m_id 
    from (
        select * from first
        where event_date <= cast(date_sub(current_date, ${window}) as string)
    ) a 
    join (
        select * from second
    ) b on a.c_id = b.c_id 
    left join third c 
    on a.user_id = c.id1
    ) dx 
group by c_id, v_id;

I have changed the names but otherwise this is the select statement being used to insert overwrite to another table. Regarding the join

join (
        select * from second
    ) b on a.c_id = b.c_id 

b is not used anywhere except for join condition, so is this join serving any purpose at all?
Is it for making sure that this join only has entries where c_id is present in second table? Would a where IN condition be better if thats all this is doing. Or I can just remove this join and it won't make any difference at all.

Thanks.


Solution

    1. Join (any inner, left or right) can duplicate rows if join key in joined dataset is not unique. For example if a contains single row with c_id=1 and b contains two rows with c_id=1, the result will be two rows with a.c_id=1.
    2. Join (inner) can filter rows if join key is absent in joined dataset. I believe this is what it meant to do.

    If the goal is to get only rows with keys present in both datasets(filter) and you do not want duplication, and you do not use columns from joined dataset, then better use LEFT SEMI JOIN instead of JOIN, it will work as filter only even if there are duplicated keys in joined dataset:

    left semi join (
                    select c_id from second
                   ) b on a.c_id = b.c_id 
    

    This is much safer way to filter rows only which exist in both a and b and avoid unintended duplication.

    You can replace join with WHERE IN/EXISTS, but it makes no difference, it is implemented as the same JOIN, check the EXPLAIN output and you will see the same query plan. Better use LEFT SEMI JOIN, it implements uncorrelated IN/EXISTS in efficient way.

    If you prefer to move it to the WHERE:

    WHERE a.c_id IN (select c_id from second)
    

    or correlated EXISTS:

    WHERE EXISTS (select 1 from second b where a.c_id=b.c_id)
    

    But as I said, all of them are implemented internally using JOIN operator.