I am running a code which basically goes like this:
Create table abc as
select A.* from
table1 A
Left outer join
table2 B
on
A.col1=B.col1 and A.col2=B.col2;
Number of records in table1=7009102 Number of records in table2=1787493
I have similar 6 queries in my script but my script is stuck on the 4th such query. I tried running via tez and mapreduce but both have the same issue.
In mapreduce it is stuck at map 0% nd reduce 0% even after an hour. There are no reducers In Tez, its only 22% in 1 hour.
Upon checking the logs it shows many entries like 'progress of TaskAttempt attempt_12334_m_000003_0 is: 0.0'.
I ran the job in tez, and now its almost 3 hours and the job is about to finish with 2 failed in Map-2 Vertice.
General tips to improve Hive queries to run faster
1. Use ORC File
Hive supports ORC file – a new table storage format that sports fantastic speed improvements through techniques like predicate pushdown (pushup in Hive), compression and more.
Using ORCFile for every HIVE table should really be a no-brainer, and extremely beneficial to get fast response times for your HIVE queries.
CREATETABLEA_ORC (
customerIDint, namestring, age int, address string
)
2. Use Vectorization Vectorized query execution improves performance of operations like scans, aggregations, filters, and joins, by performing them in batches of 1024 rows at once instead of a single row each time. Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
I. sethive.vectorized.execution.enabled = true;
II. sethive.vectorized.execution.reduce.enabled = true;
3. Partition Based Joins:
To optimize joins in Hive, we have to reduce the query scan time. For that, we can create a Hive table with partitions by specifying the partition predicates in the ‘WHERE’ clause or the ON clause in a JOIN.
For Example: The table ‘state view’ is partitioned on the column ‘state.’
The below query retrieves rows for only a given state:
Optimizing Joins In Hive
SELECT state_view.* FROM state view WHERE state_view.state= ‘State-1’ AND state_view.state = ‘State-3’;
If a table state view is joined with another table city users, you can specify a range of partitions in the ON clause as follows:
SELECT state_view.* FROM state_view JOIN city_users ON (state_view.state = city_users.state);
Hope this post helped you with all your joins optimization needs in Hive.