Search code examples
hadoophivemapreducehive-configuration

Validate Hive Single and Multi Query Parallelism


I configured Hive parallelism with below hive-site.xml properties and restarted the cluster

Property 1

Name: hive.exec.parallel
Value: true
Description: Run hive jobs in parallel

Property 2

Name: hive.exec.parallel.thread.number
Value: 8 (default)
Description: Maximum number of hive jobs to run in parallel

To test parallelism, I created below 2 conditions:

1. Single Query in file.hql and Run it as hive -f file.hql

SELECT COL1, COL2 FROM TABLE1
UNION ALL
SELECT COL3, COL4 FROM TABLE2

Result:

When hive.exec.parallel = true, Time taken: 28.015sec, Total MapReduce CPU Time Spent: 3seconds 10msec

When hive.exec.parallel = false, Time taken: 24.778 seconds, Total MapReduce CPU Time Spent: 3 seconds 90 msec.

2. Independent queries in 2 different files as below and run it as nohup hive -f file1.hql & nohup hive -f file2.hql

select count(1) from t1 -> file1.sql
select count(1) from t2 -> file2.sql

Result:

When hive.exec.parallel = false, Time taken: 29.391 seconds, Total MapReduce CPU Time Spent: 1 seconds 890 msec

Question:

How do I check above 2 conditions are indeed running in parallel? In console, I see the result as if queries were running sequentially.

Why the Time taken is more when hive.exec.parallel = true ? How can I see that hive multiple stages are utilized?

Thank you,


Solution

  • When Hive execution engine is MR (hive.execution.engine=mr), Hive represents query as one or more Map-Reduce jobs, these jobs (each containing Map and reduce) can be executed in parallel if possible. For example this query:

    SELECT COL1, COL2 FROM TABLE1
    UNION
    SELECT COL3, COL4 FROM TABLE2
    

    can be executed as 3 jobs: 1 - select from table1, 2-select table2, 3-UNION (distinct)

    First two jobs can be executed in parallel and third one after completion of first and second.

    More complex query can be executed as many MR jobs ad these parameters:

    hive.exec.parallel and hive.exec.parallel.thread.number allows parallel execution of Jobs for single query running on MR.

    You can check jobs on Job Tracker, the URL is printed in the logs during execution. You can see in the logs that some jobs are started and their execution progress.

    If running on Tez execution engine(hive.execution.engine=Tez), Hive represents query as a single optimized DAG, omitting unnecessary steps like writing intermediate results into persistent storage and reading them again using mapper. All vertices in the DAG which can be executed in parallel are being executed in parallel. The same settings do not work when running on Tez. It is always running parallel on Tez. The same query will be represented as 2 mapper vertices (running in parallel) and reducer running at the end. The last reducer also can start early when mappers almost completed.

    Settings hive.exec.parallel and hive.exec.parallel.thread.number do not affect parallelism of query on Tez, also they do not work for two separate queries in single script.

    Two separate queries in single script are running one by one, not parallel (each with it's own task parallelism)

    Two hive sessions like in your last example are running in parallel (depends on cluster resources available)

    Difference in time can be measured using time Unix command. Time reported by Hive is cluster time. If cluster has no resources available parallel tasks can wait for resources. Use Job tracker to check what exactly happens during execution.

    So, actually there are different kinds of parallelism.

    Single query Jobs parallelism on MR - parameters you are asking for are for this kind.

    Hive sessions are running in parallel - these parameters do not affect it.

    Tez vertices parallelism - these parameters do not affect it

    Parallel execution of the same vertex instance (mapper or reducer, each can be started more than one) - they are running parallel - these parameters do not affect it