Search code examples
hadoophadoop-yarnemr

Run a Unix shell command on every EMR / Yarn node


I want to install a Python module on every node in an Amazon EMR cluster. It looks like the obvious way to do this is to ssh in to every node and install it at the command line. I was looking at YARN as a way of running the same JAR file on every node in a cluster, but YARN's "jar" command seems to run on the local system.


Solution

  • You can use bootstrap to install 3rd party software on every EMR node while bringing up the cluster.

    If you are using command line, you can pass shell script that is saved in s3 as part of bootstrap action.

    aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
    --use-default-roles --ec2-attributes KeyName=myKey \
    --applications Name=Hue Name=Hive Name=Pig \
    --instance-count 5 --instance-type m3.xlarge \
    --bootstrap-action Path="s3://elasticmapreduce/bootstrap-actions/download.sh"
    

    If you are using web interface

    • Create shell script to download necessary software
    • Go to advanced options and as part of General Cluster Settings you can specify Bootstrap Actions
    • Every time you clone the cluster, those actions will be preserved and make sure bootstrap is done while bringing up the cluster.