Adding or removing nodes from an existing GCE hadoop/spark cluster with bdutil

I'm getting started with running a spark cluster on google compute engine backed by google cloud storage that is deployed with bdutil (on the GoogleCloudPlatform github), I am doing this as follows:

./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket deploy

I am expecting I might want to start with a cluster of 2 nodes (as the default) and would later like to add another worker node to cope with a big job that needs to be run. I would like to do this without completely destroying and re-deploying the cluster if possible.

I have tried re-deploying using the same command with a different number of nodes, or running a "create" and "run_command_group install_connectors", as below, but for each of these I get errors about the already existing nodes, e.g.

./bdutil -n 3 -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket deploy

./bdutil -n 3 -b myhdfsbucket create
./bdutil -n 3 -t workers -b myhdfsbucket run_command_group install_connectors

I've also tried snapshotting and cloning one of the workers already running, but not all the services seem to start correctly and I am left a bit out of my depth there.

Any guidance as to how I could/should add and/or remove nodes from an already existing cluster?

Solution

Update: We added the resize_env.sh to the base bdutil repo so you don't need to go to my fork of it anymore

Original answer:

There isn't official support for resizing a bdutil-deployed cluster just yet, but it's certainly something we've discussed before, and it's in fact fairly doable to put together some basic support for resizing. This may take a different form once merged into the master branch, but I've pushed a first draft of resize support to my fork of bdutil. This was implemented across two commits; one to allow skipping all "master" operations (including create, run_command, delete, etc) and another to add the resize_env.sh file.

I haven't tested it against all combinations of other bdutil extensions, but I've at least successfully run it with base bdutil_env.sh and with extensions/spark/spark_env.sh. In theory it should work fine with your bigquery and datastore extensions as well. To use it in your case:

# Assuming you initially deployed with this command (default n == 2)
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 deploy

# Before this step, edit resize_env.sh and set NEW_NUM_WORKERS to what you want.
# Currently it defaults to 5.
# Deploy only the new workers, e.g. {hadoop-w-2, hadoop-w-3, hadoop-w-4}:
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 -e resize_env.sh deploy

# Explicitly start the Hadoop daemons on just the new workers:
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 -e resize_env.sh run_command -t workers -- "service hadoop-hdfs-datanode start && service hadoop-mapreduce-tasktracker start"

# If using Spark as well, explicitly start the Spark daemons on the new workers:
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 -e resize_env.sh run_command -t workers -u extensions/spark/start_single_spark_worker.sh -- "./start_single_spark_worker.sh"

# From now on, it's as if you originally turned up your cluster with "-n 5".
# When deleting, remember to include those extra workers:
./bdutil -b myhdfsbucket -n 5 delete

In general, the best-practice recommendation is to condense your configuration into a file instead of always passing flags. For example, in your case you might want a file called my_base_env.sh:

import_env bigquery_env.sh
import_env datastore_env.sh
import_env extensions/spark/spark_env.sh

NUM_WORKERS=2
CONFIGBUCKET=myhdfsbucket

Then the resize commands are much shorter:

# Assuming you initially deployed with this command (default n == 2)
./bdutil -e my_base_env.sh deploy

# Before this step, edit resize_env.sh and set NEW_NUM_WORKERS to what you want.
# Currently it defaults to 5.
# Deploy only the new workers, e.g. {hadoop-w-2, hadoop-w-3, hadoop-w-4}:
./bdutil -e my_base_env.sh -e resize_env.sh deploy

# Explicitly start the Hadoop daemons on just the new workers:
./bdutil -e my_base_env.sh -e resize_env.sh run_command -t workers -- "service hadoop-hdfs-datanode start && service hadoop-mapreduce-tasktracker start"

# If using Spark as well, explicitly start the Spark daemons on the new workers:
./bdutil -e my_base_env.sh -e resize_env.sh run_command -t workers -u extensions/spark/start_single_spark_worker.sh -- "./start_single_spark_worker.sh"

# From now on, it's as if you originally turned up your cluster with "-n 5".
# When deleting, remember to include those extra workers:
./bdutil -b myhdfsbucket -n 5 delete

Finally, this isn't quite 100% the same as if you'd deployed the cluster with -n 5 initially; the files on your master node /home/hadoop/hadoop-install/conf/slaves and /home/hadoop/spark-install/conf/slaves in this case will be missing your new nodes. You can manually SSH into your master node and edit these files to add your new nodes to the lists if you ever plan to use /home/hadoop/hadoop-install/bin/[stop|start]-all.sh or /home/hadoop/spark-install/sbin/[stop|start]-all.sh; if not, then there's no need to change those slaves files.