hadoop amazon-web-services apache-spark emr amazon-emr

How AWS EMR resize

I have some question in mind while using AWS EMR today.

EMR provide very simple way for us to resize cluster, adding removing some nodes is easy.

In apache hadoop, we can modify slaves file to change add or remove nodes. But I found slaves file in EMR contains just localhost and I can't find any other configuration that indicate where slaves are.

How EMR add or remove nodes from the cluster without even restart process in Master node?

Solution

The master and slaves files are only used by the shell scripts like start-all.sh, start-dfs.sh etc. These files are not used by any other function in hadoop. From the hadoop cluster perspective, the location where the namenode, secondary namenode, worker nodes are not defined by these files. EMR is not using those shell scripts for starting the cluster. The property fs.default.name or fs.defaultFS in the core-site.xml defines the namenode host. All the datanodes that starts with this configuration will report to the namenode and gets added to the cluster. Similarly, the resourcemanager host is defined in the yarn-site.xml of all the nodes.

We don't need to restart any process in the cluster for adding new nodes. Once the datanode is up, it will report to the namenode and in this way the node will contribute to the HDFS. Similarly once the nodemanager is up, it will report to the resourcemanager of the cluster and it will contribute to the processing layer.

In EMR we have 3 types of nodes.

Master node
Core node
Task node

For an EMR cluster the master nodes will be only one. This node is the node which has namenode and all the master services such as Resourcemanager, HBase Master etc.

Core node is the node which has storage as well as processing capability which means it has datanode and nodemanager. We can increase the number of core nodes, but we can't decrease the number because it will result in data loss.

Task nodes are the nodes which has only processing capability. This is basically for serving transient loads. This has only nodemanager. No datanode is associated with this node. We can increase or decrease the number of task nodes.

While resizing the cluster, the existing cluster is not getting disturbed. The scripts like start-all.sh, stop-all.sh are not invoked in EMR. It starts individual services and brings up the cluster. So the entries in the master and slaves files are not considered.