Search code examples
hadoophivemapreducehadoop-yarntez

Why is hive join taking too long?


I am running a code which basically goes like this:

Create table abc as 
select A.* from
table1 A
Left outer join
table2 B 
on 
A.col1=B.col1 and  A.col2=B.col2;

Number of records in table1=7009102 Number of records in table2=1787493

I have similar 6 queries in my script but my script is stuck on the 4th such query. I tried running via tez and mapreduce but both have the same issue.

In mapreduce it is stuck at map 0% nd reduce 0% even after an hour. There are no reducers In Tez, its only 22% in 1 hour.

Upon checking the logs it shows many entries like 'progress of TaskAttempt attempt_12334_m_000003_0 is: 0.0'.

I ran the job in tez, and now its almost 3 hours and the job is about to finish with 2 failed in Map-2 Vertice.


Solution

  • General tips to improve Hive queries to run faster

    1. Use ORC File
    Hive supports ORC file – a new table storage format that sports fantastic speed improvements through techniques like predicate pushdown (pushup in Hive), compression and more.
    Using ORCFile for every HIVE table should really be a no-brainer, and extremely beneficial to get fast response times for your HIVE queries.

    CREATETABLEA_ORC ( 
    customerIDint, namestring, age int, address string 
    )
    

    2. Use Vectorization Vectorized query execution improves performance of operations like scans, aggregations, filters, and joins, by performing them in batches of 1024 rows at once instead of a single row each time. Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:

    I. sethive.vectorized.execution.enabled = true;
    II. sethive.vectorized.execution.reduce.enabled = true;
    

    3. Partition Based Joins: To optimize joins in Hive, we have to reduce the query scan time. For that, we can create a Hive table with partitions by specifying the partition predicates in the ‘WHERE’ clause or the ON clause in a JOIN.
    For Example: The table ‘state view’ is partitioned on the column ‘state.’ The below query retrieves rows for only a given state: Optimizing Joins In Hive

    SELECT state_view.* FROM state view WHERE state_view.state= ‘State-1’ AND state_view.state = ‘State-3’;
    

    If a table state view is joined with another table city users, you can specify a range of partitions in the ON clause as follows:

    SELECT state_view.* FROM state_view JOIN city_users ON (state_view.state = city_users.state); 
    

    Hope this post helped you with all your joins optimization needs in Hive.