Search code examples
amazon-ec2amazon-web-serviceselastic-map-reduceamazon-emr

How to configure an Amazon EMR streaming job to use EC2 spot instances (Ruby CLI)?


When I create a streaming job with Amazon Elastic MapReduce (Amazon EMR), using the Ruby command line interface, how can I specify to use only EC2 spot instances (except for master)? The command below is working, but it "forces" me to use at lease 1 core instance...

./elastic-mapreduce --create --stream          \
--name    n2_3                             \
--input   s3://mr/neuron/2              \
--output  s3://mr-out/neuron/2          \
--mapper  s3://mr/map.rb         \
--reducer s3://mr/noop_reduce.rb \
--instance-group master --instance-type m1.small --instance-count 1 \
--instance-group core   --instance-type m1.small --instance-count 1 \
--instance-group task   --instance-type m1.small --instance-count 18 --bid-price 0.028

Thanks


Solution

  • Both CORE and TASKS nodes run TaskTrackers but only CORE nodes run DataNodes so, yes, you need at least one CORE node.

    So you could run spot core nodes?

    ./elastic-mapreduce --create --stream \
    ...
    --instance-group master --instance-type m1.small --instance-count 1 \
    --instance-group core   --instance-type m1.small --instance-count 19 --bid-price 0.028
    

    p.s. you also could run one CORE and many TASK nodes but, depending on how much reading/writing you're doing, you'll have pain since 18 nodes will be reading/writing to 1 node.

    # expect problems....
    ./elastic-mapreduce --create --stream \
    ...
    --instance-group master --instance-type m1.small --instance-count 1 \
    --instance-group core   --instance-type m1.small --instance-count 1  --bid-price 0.028
    --instance-group task   --instance-type m1.small --instance-count 18 --bid-price 0.028