Search code examples
amazon-web-servicesemrimpala

Must impalad be running on a datanode?


A little background:

I've gotten Impala 2.2 running on Amazon EMR 4.1 (which in itself was a huge headache) - with 1 master node, 3 core nodes and 3 task nodes.

It was our understanding after talking with AWS solutions architects that we could have a long running "core cluster" with the master and core nodes comprising the persistent HDFS storage. We would then be able to add an appropriate number of task nodes on demand which would quickly move through the jobs we submitted before being shut down again.

The Issue:

The issue we're seeing is that the tasks nodes are not participating in most queries, such as those involving compute stats.

Is this an Impala behavior or an Impala on EMR behavior?

Impala has the concept of remote reads, so is there a way to loosen the criteria to include non-datanodes in the processing?


Solution

  • Impala does expect to be on datanodes, this is critical to its performance gains from reading HDFS locally at each node.