I want to install a Python module on every node in an Amazon EMR cluster. It looks like the obvious way to do this is to ssh in to every node and install it at the command line. I was looking at YARN as a way of running the same JAR file on every node in a cluster, but YARN's "jar" command seems to run on the local system.
You can use bootstrap to install 3rd party software on every EMR node while bringing up the cluster.
If you are using command line, you can pass shell script that is saved in s3 as part of bootstrap action.
aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
--use-default-roles --ec2-attributes KeyName=myKey \
--applications Name=Hue Name=Hive Name=Pig \
--instance-count 5 --instance-type m3.xlarge \
--bootstrap-action Path="s3://elasticmapreduce/bootstrap-actions/download.sh"
If you are using web interface
advanced options
and as part of General Cluster Settings
you can specify Bootstrap Actions