Search code examples
pythonamazon-web-servicesbotoamazon-emr

How to launch and configure an EMR cluster using boto


I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows:

  1. How to define the cluster to be used (by clusted_id)
  2. How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes)

Am I missing something?


Solution

  • Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.

    You create a new cluster by calling the boto.emr.connection.run_jobflow() function. It will return the cluster ID which EMR generates for you.

    First all the mandatory things:

    #!/usr/bin/env python
    
    import boto
    import boto.emr
    from boto.emr.instance_group import InstanceGroup
    
    conn = boto.emr.connect_to_region('us-east-1')
    

    Then we specify instance groups, including the spot price we want to pay for the TASK nodes:

    instance_groups = []
    instance_groups.append(InstanceGroup(
        num_instances=1,
        role="MASTER",
        type="m1.small",
        market="ON_DEMAND",
        name="Main node"))
    instance_groups.append(InstanceGroup(
        num_instances=2,
        role="CORE",
        type="m1.small",
        market="ON_DEMAND",
        name="Worker nodes"))
    instance_groups.append(InstanceGroup(
        num_instances=2,
        role="TASK",
        type="m1.small",
        market="SPOT",
        name="My cheap spot nodes",
        bidprice="0.002"))
    

    Finally we start a new cluster:

    cluster_id = conn.run_jobflow(
        "Name for my cluster",
        instance_groups=instance_groups,
        action_on_failure='TERMINATE_JOB_FLOW',
        keep_alive=True,
        enable_debugging=True,
        log_uri="s3://mybucket/logs/",
        hadoop_version=None,
        ami_version="2.4.9",
        steps=[],
        bootstrap_actions=[],
        ec2_keyname="my-ec2-key",
        visible_to_all_users=True,
        job_flow_role="EMR_EC2_DefaultRole",
        service_role="EMR_DefaultRole")
    

    We can also print the cluster ID if we care about that:

    print "Starting cluster", cluster_id