google-compute-engine hortonworks-data-platform google-hadoop

Hadoop on Google Compute Engine: how to add external software

I need to set up an Hadoop cluster on Google Compute Engine. While it seems straightforward either using the web console Click&Deploy or via the command line tool bdutil, my concern is that my jobs require additional dependencies present on the machines, for instance Xvfb, Firefox, and others-- though all installable via apt-get.

It's not clear to me the best way to go. The options that come in my mind are:

1) I create a custom image with the additional stuff, and use it for deploying the hadoop cluster, either via or click&deploy. Would that work?

2) Use a standard image and bdutil with a custom configuration files (editing an existing one) to perform all the sudo apt-get install xxx. Is it a viable option?

Option 1) is basically what I had to do in the past to run Hadoop on AWS, and honestly it's a pain to maintain. I'll be more than happy with Option 2) bit I'm not sure butil is allowed to do that.

Do you see any other way to set up the hadoop cluster? Hany help is appreciated!

Solution

bdutil in fact is designed to support custom extensions; you can certainly edit an existing one for an easy way to get started, but the recommended best-practice is to create your own "_env.sh" extension which can be mixed in with other bdutil extensions if necessary. This way you can more easily merge any updates Google makes to core bdutil without worrying about conflicts with your customizations. You only need to create two files, for example:

File with shell commands:

# install_my_custom_tools.sh

# Shell commands to install whatever you want
apt-get -y install Xvfb

File referencing the commands file which you'll plug into bdutil:

# my_custom_tools_env.sh

COMMAND_GROUPS+=(
  "install_my_custom_tools_group:
     install_my_custom_tools.sh
  "
)

COMMAND_STEPS+=(
  'install_my_custom_tools_group,install_my_custom_tools_group'
)

Then, when running bdutil you can simply mix it in with the -e flag:

./bdutil -e my_custom_tools_env.sh deploy

If you want to organize helper scripts into multiple files, you can easily list more shell scripts within a single COMMAND_GROUP:

COMMAND_GROUPS+=(
  "install_my_custom_tools_group:
     install_my_custom_tools.sh
     my_fancy_configuration_script.sh
  "
)

If you want something to only run on the master, simply provide * to the second argument within COMMAND_STEPS:

COMMAND_GROUPS+=(
  "install_my_custom_tools_group:
     install_my_custom_tools.sh
  "
  "install_on_master_only:
     install_fancy_master_tools.sh
  "
)
COMMAND_STEPS+=(
  'install_my_custom_tools_group,install_my_custom_tools_group'
  'install_on_master_only,*'
)

When using these, you can still easily mix with other env files, for example:

./bdutil -e my_custom_tools_env.sh -e extensions/spark/spark_env.sh deploy

For files residing in the same directory as bdutil or under the extensions directory, you can also use a shorthand notation, only specifying the file basename without the _env.sh suffix:

./bdutil -e my_custom_tools -e spark deploy