hadoop mapreduce hadoop-yarn azure-hdinsight

Not all nodes are being utilized in cluster

I have a 30 node Hadoop MR2 cluster being managed by YARN. There are currently 10 Oozie jobs, each running a single Map program. I'm noticing that only 11 of the 30 nodes are actually being utilized; only 11 nodes have containers running the Map programs.

I would expect each node to be have at least one container running. Why is that not the case? Is it due to the input splits, and that based on my HDFS block size setting, the input data was best split to only 11 nodes? If that's the case, would it be more optimal to adjust the block size so all nodes get utilized?

Solution

Depending on the requests, the resource manager will allocate the required resources in the cluster. These resources will be used into the containers, which run your map reduce jobs.

Data node can host more than one container if there are enough resouces available. Don't forget that in hadoop is the computing which is moved to data and not the reverse. The data nodes which are running mapreduce jobs are most likely the one which are storing the data you are processing. Input split, which depends on data block, does not affect directly hosts involved in computation.

It's bad idea to think all nodes should run. Using big data best is move as less data as possibile.