Search code examples
hadoopamazon-ec2configuration-management

Setting up multi node Hadoop cluster automatically


I have an EC2 image that I made with Hadoop installed. However, I set it up to be roleless upon instantiation (it isn't a slave or a master). In order to start a Hadoop cluster I launch as many instances (nodes) as I need on EC2, then I have to do the 3 following things to each node:

  1. Update /etc/hosts to contain the necessary IP addresses.
  2. If master node, change $HADOOP_HOME/conf/masters and $HADOOP_HOME/conf/slaves
  3. Enable SSH access between the nodes.

I'd like to be able to find a way to do this automatically so that for an arbitrary amount of nodes, I don't have to go in and set all these settings on each one.

How do other people deal with setting up Hadoop clusters automatically? Is there a way to automate the networking part?

I'm not sure it would be possible since the IP addresses will be different every time, but I want to know what other people have tried or what is commonly used. Is there a good way to automate these processes so every time I set up a cluster for testing I don't have to do these for every node? I don't know much about Linux scripting, is this possible with a script? Or will I just have to deal with configuring every node manually?


Solution

  • I have no experience with Hadoop, but in general the task you have is called "configuration management". In general you write some "receipes" and define "roles" (master, slave) for your servers. Such a role may contain config files for services, to-be-installed packages, hostname changes, SSH keys etc. After the servers have initially started up, you can tell them which role they should be and they will install automatically.

    There are different tools available for these tasks, examples are Puppet or Salt. There is a comparison available at Wikipedia.