There is Hive 2.1.1 over MR, table test_table
stored as sequencefile and the following ad-hoc query:
select t.*
from test_table t
where t.test_column = 100
Although this query can be executed without starting MR (fetch task), sometimes it takes longer to scan HDFS files rather than triggering a single map job.
When I want to enforce MR execution, I make the query more complex: e.g., using distinct
. The significant drawbacks of this approach are:
Is there a recommended way to force MR execution when using Hive-on-MR?
The hive executor decides either to execute map task or fetch task depending on the following settings (with defaults):
select count(*) from src
also can be executed in a fetch taskIt prompts me the following two options:
For some reason lowering the threshold did not change anything in my case, so I stood with the second option: seems fine for ad-hoc queries.
More details regarding these settings can be found in Cloudera forum and Hive wiki.