I need to set up an Hadoop cluster on Google Compute Engine. While it seems straightforward either using the web console Click&Deploy or via the command line tool bdutil
, my concern is that my jobs require additional dependencies present on the machines, for instance Xvfb
, Firefox
, and others-- though all installable via apt-get
.
It's not clear to me the best way to go. The options that come in my mind are:
1) I create a custom image with the additional stuff, and use it for deploying the hadoop cluster, either via or click&deploy. Would that work?
2) Use a standard image and bdutil
with a custom configuration files (editing an existing one) to perform all the sudo apt-get install xxx
. Is it a viable option?
Option 1) is basically what I had to do in the past to run Hadoop on AWS, and honestly it's a pain to maintain. I'll be more than happy with Option 2) bit I'm not sure butil
is allowed to do that.
Do you see any other way to set up the hadoop cluster? Hany help is appreciated!
bdutil
in fact is designed to support custom extensions; you can certainly edit an existing one for an easy way to get started, but the recommended best-practice is to create your own "_env.sh"
extension which can be mixed in with other bdutil extensions if necessary. This way you can more easily merge any updates Google makes to core bdutil
without worrying about conflicts with your customizations. You only need to create two files, for example:
File with shell commands:
# install_my_custom_tools.sh
# Shell commands to install whatever you want
apt-get -y install Xvfb
File referencing the commands file which you'll plug into bdutil:
# my_custom_tools_env.sh
COMMAND_GROUPS+=(
"install_my_custom_tools_group:
install_my_custom_tools.sh
"
)
COMMAND_STEPS+=(
'install_my_custom_tools_group,install_my_custom_tools_group'
)
Then, when running bdutil you can simply mix it in with the -e
flag:
./bdutil -e my_custom_tools_env.sh deploy
If you want to organize helper scripts into multiple files, you can easily list more shell scripts within a single COMMAND_GROUP
:
COMMAND_GROUPS+=(
"install_my_custom_tools_group:
install_my_custom_tools.sh
my_fancy_configuration_script.sh
"
)
If you want something to only run on the master, simply provide *
to the second argument within COMMAND_STEPS
:
COMMAND_GROUPS+=(
"install_my_custom_tools_group:
install_my_custom_tools.sh
"
"install_on_master_only:
install_fancy_master_tools.sh
"
)
COMMAND_STEPS+=(
'install_my_custom_tools_group,install_my_custom_tools_group'
'install_on_master_only,*'
)
When using these, you can still easily mix with other env files, for example:
./bdutil -e my_custom_tools_env.sh -e extensions/spark/spark_env.sh deploy
For files residing in the same directory as bdutil or under the extensions
directory, you can also use a shorthand notation, only specifying the file basename without the _env.sh
suffix:
./bdutil -e my_custom_tools -e spark deploy