Search code examples
hadoophivequery-optimizationhiveqlapache-tez

Hive Tez reducers are running super slow


I have joined multiple tables and the total no of rows are around 25 billion. On top of that, I am doing aggregation. Here are my hive settings as below, which I am using to generate the final output. I am not really sure how to tune the query and make it run faster. Currently, I am doing trial and error and see if that can produce some results but that doesn't seem to be working.Mappers are running faster but reducers are taking forever to finish off. Could anyone share your thoughts on this? Thank you.

    SET hive.execution.engine=tez;
    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.qubole.cleanup.partial.data.on.failure=true;
    SET hive.tez.container.size=8192;
    SET tez.task.resource.memory.mb=8192;
    SET tez.task.resource.cpu.vcores=2;
    SET hive.mapred.mode=nonstrict;
    SET hive.qubole.dynpart.use.prefix=true;
    SET hive.vectorized.execution.enabled=true;
    SET hive.vectorized.execution.reduce.enabled =true;
    SET hive.cbo.enable=true;
    SET hive.compute.query.using.stats=true;
    SET hive.stats.fetch.column.stats=true;
    SET hive.stats.fetch.partition.stats=true;
    SET mapred.reduce.tasks = -1;
    SET hive.auto.convert.join.noconditionaltask.size=2730;
    SET hive.auto.convert.join=true;
    SET hive.auto.convert.join.noconditionaltask=true;
    SET hive.auto.convert.join.noconditionaltask.size=8053063680;
    SET hive.compute.query.using.stats=true;
    SET hive.stats.fetch.column.stats=true;
    SET hive.stats.fetch.partition.stats=true;
    SET mapreduce.job.reduce.slowstart.completedmaps=0.8;
    set hive.tez.auto.reducer.parallelism = true;
    set hive.exec.reducers.max=100;
    set hive.exec.reducers.bytes.per.reducer=1024000000;

SQL:

SELECT D.d
      ,D.b
      ,COUNT(DISTINCT A.x)  AS cnt
      ,SUM(c)               AS sum
 FROM A
LEFT JOIN
       B
ON A.a = B.b
LEFT JOIN
       C 
ON B.b = C.c
JOIN
       D
 ON A.a >= D.d
AND A.a <= D.d
GROUP BY 1,2
CLUSTER BY D.d;

Solution

  • Do not have query plan yet, so maybe there is something else, but these settings definitely are limiting reducers parallelism:

    set hive.exec.reducers.max=100;
    set hive.exec.reducers.bytes.per.reducer=1024000000;
    

    I'd suggest to increase the number of reducers allowed and reduce bytes per reducer, this will increase parallelism on reducers:

    set hive.exec.reducers.max=5000; 
    set hive.exec.reducers.bytes.per.reducer=67108864;
    

    Also Hive 1.2.0+ provides auto-rewrite optimization for count(distinct). Check this setting, it should be true by default:

    hive.optimize.distinct.rewrite=true;
    

    And if the query stuck on the last reducer, then there is a skew in join keys