Why is hive join taking too long?

I am running a code which basically goes like this:

Create table abc as 
select A.* from
table1 A
Left outer join
table2 B 
on 
A.col1=B.col1 and  A.col2=B.col2;

Number of records in table1=7009102 Number of records in table2=1787493

I have similar 6 queries in my script but my script is stuck on the 4th such query. I tried running via tez and mapreduce but both have the same issue.

In mapreduce it is stuck at map 0% nd reduce 0% even after an hour. There are no reducers In Tez, its only 22% in 1 hour.

Upon checking the logs it shows many entries like 'progress of TaskAttempt attempt_12334_m_000003_0 is: 0.0'.

I ran the job in tez, and now its almost 3 hours and the job is about to finish with 2 failed in Map-2 Vertice.

Solution

General tips to improve Hive queries to run faster

1. Use ORC File
Hive supports ORC file – a new table storage format that sports fantastic speed improvements through techniques like predicate pushdown (pushup in Hive), compression and more.
Using ORCFile for every HIVE table should really be a no-brainer, and extremely beneficial to get fast response times for your HIVE queries.

CREATETABLEA_ORC ( 
customerIDint, namestring, age int, address string 
)

2. Use Vectorization Vectorized query execution improves performance of operations like scans, aggregations, filters, and joins, by performing them in batches of 1024 rows at once instead of a single row each time. Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:

I. sethive.vectorized.execution.enabled = true;
II. sethive.vectorized.execution.reduce.enabled = true;

3. Partition Based Joins: To optimize joins in Hive, we have to reduce the query scan time. For that, we can create a Hive table with partitions by specifying the partition predicates in the ‘WHERE’ clause or the ON clause in a JOIN.
For Example: The table ‘state view’ is partitioned on the column ‘state.’ The below query retrieves rows for only a given state: Optimizing Joins In Hive

SELECT state_view.* FROM state view WHERE state_view.state= ‘State-1’ AND state_view.state = ‘State-3’;

If a table state view is joined with another table city users, you can specify a range of partitions in the ON clause as follows:

SELECT state_view.* FROM state_view JOIN city_users ON (state_view.state = city_users.state);

Hope this post helped you with all your joins optimization needs in Hive.