Search code examples
hadoopemrmrjob

Attach the same EBS snapshot to every EMR volume?


I want to work with an EBS snapshot in an EMR job. Because the mapper reads from the snapshot, I want the snapshot mounted on every node. Is there an easy way to do that other than logging in to each node? I guess I could make the first step of my mapreduce job to mount it, but that seems wrong. Is there an easier way to do it?


Solution

  • It is possible, but you'll have to jump through some hoops to get it to work. Assuming you have recipe to create an EBS volume from the EBS snapshot in a shell script. EMR provides bootstrap actions, which are just shell scripts you can create and run. Bootstrap actions are run before any jobs (steps in EMR) are allowed to run.

    Here are the steps you need to have your shell script perform:

    1. Create a new EBS volume based on your snapshot. The aws binary is installed on all EMR instances, so that's your best bet. Assuming you know the snapshot id, this should be straightforward: http://docs.aws.amazon.com/cli/latest/reference/ec2/create-volume.html
      • Make sure you include the DeleteOnTermination attachment.
      • You will need to parse the response to get the EBS volume id.
    2. Attach the volume you just created (using the EBS volume id) to the current instance: http://docs.aws.amazon.com/cli/latest/reference/ec2/attach-volume.html

    To get the current instance id, use the metadata service:

    wget -q -O - http://instance-data/latest/meta-data/instance-id
    

    Once you have your shell script, you need to upload it to S3, and then add that script as a bootstrap action to your cluster: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html

    Also beware, you will be charged for each EBS volume you create, so ensure the delete on termination logic is setup properly!