Search code examples
hadoopjoinhivehiveqlazure-hdinsight

Hive: Inner Join query executing forever due to last Reducer job


Using Hive 1.2.1000.2 on Azure HDInsight 3.6 performing an INNER JOIN to get the count of records that are present both in Table_1 and Table_2.

Details of the tables:

Table_1: 310M records

Sample data:

master_id        modelkey     order_id  
---------------------------------------
mi0000bd1444     4874         d988e53cd
mi000097d5       44365        p0905gd44
mi0000d2ab09ea   309141         
mi0001d6a        8705         7574  
mi00011f7c085    4063         d165804b2
mi0001a57db      314          9c84ft879 

Table_2: 35M records

Sample data:

order_id    vendor_id
---------------------------------------
81d162f23   7122a0c
6988e53cd   517ba6e
5165804b2   5c5e161
47ba91ea3   7686b2d
f45cab9de   35be1af

Below are the details of what I've tried so far.

Hive query:

SELECT COUNT(*) 
FROM db.table_1 t1
INNER JOIN db.table_2 t2 ON t1.order_id = t2.order_id;

Hive properties:

SET hive.tez.container.size=10240;
SET tez.am.resource.memory.mb=10240;
SET tez.task.resource.memory.mb=10240;
SET hive.execution.engine=tez;
SET hive.exec.compress.output=true;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.optimize.skewjoin=true;
SET hive.skewjoin.key=100000;

The query is executing for >7 hours and getting stuck at the last Reducer job .e.

--------------------------------------------------------------------------------
    VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Map 4 ..........   SUCCEEDED    715        715        0        0       0       0
Reducer 2 .....      RUNNING    189        188        1        0       0       0
Reducer 3            RUNNING      1          0        1        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/04  [=========================>>-] 99%   ELAPSED TIME: 25307.97 s  
--------------------------------------------------------------------------------

Is there a way to overcome the last Reducer's issue and, get the result?

Explain for:

  1. table_1
  2. table_2
  3. table1_table2_join

Solution

  • Performed the following steps, it helped! and hope it help others:

    1. Removed the records which had no value i.e. order_id=''
    2. Performed the JOIN in batches rather than doing all in one go
    3. Referred the below for setting certain hive properties:

    hive properties