Search code examples
pythonpackaginggoogle-cloud-dataproc

Dataproc + python package: Distribute updated versions


Currently I am developing a Spark application on Google DataProc. Frequently, I need to update the Python package. During provisioning I run the following commands:

echo "Downloading and extracting source code..."
gsutil cp gs://mybucket/mypackage.tar.gz ./
tar -xvzf mypackage.tar.gz
cd ./mypackage

echo "Installing requirements..."
sudo apt-get install -y python-pip
python setup.py install

However, what is the most effective way to distribute updated packages within the cluster? Is there any automation already built-in (like Chef e.g. does it)?

Currently, I do two different things: Deploy and bootstrap a new cluster (takes time) or SSH to each node and copy + install the updated package.


Solution

  • In general, deploying a new cluster with initialization actions is the preferred approach since it helps keep your own development workflow reproducible if you need to clone new clusters, change more fundamental machine or zone settings, or just in case you accidentally break an existing cluster in a messy way. It also ensures fresh patches for all installed software, and works well with dynamically scaling up/down your cluster compared to SSH-based configuration.

    That said, for modifying an existing cluster, you can also try using bdutil, which is coincidentally compatible with the instance naming of Dataproc, as long as you're not using any preemptible workers (but this isn't officially guaranteed to always be the case). It would provide a handy way to run commands on all your nodes via SSH with some helpful error-message gathering if it fails:

    CLUSTER=<dataproc-cluster-name>
    PROJECT=<Google project you used to create the Dataproc cluster>
    BUCKET=<mybucket>
    ZONE=<dataproc cluster zone, like us-central1-f>
    NUM_WORKERS=<number of workers in dataproc cluster>
    
    # Run "sudo apt-get install -y python-pip" on all nodes
    ./bdutil -P ${CLUSTER} -p ${PROJECT} -b ${BUCKET} -z ${ZONE} -n ${NUM_WORKERS} \
        run_command -t all -- "sudo apt-get install -y python-pip"
    

    You can also use -t master to run something only on the master node, or -t workers to only run on the worker nodes.