Search code examples
memoryhadoopstreamingemr

How to change memory in EMR hadoop streaming job


I am trying to overcome the following error in a hadoop streaming job on EMR.

Container [pid=30356,containerID=container_1391517294402_0148_01_000021] is running beyond physical memory limits

I tried searching for answers but the one I found isn't working. My job is launched as shown below.

hadoop jar ../.versions/2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \
 -input  determinations/part-00000 \
 -output  determinations/aggregated-0 \
 -mapper cat \
 -file ./det_maker.py \
 -reducer det_maker.py \
 -Dmapreduce.reduce.java.opts="-Xmx5120M"

The last line above is supposed to do the trick as far as I understand, but I get the error:

ERROR streaming.StreamJob: Unrecognized option: -Dmapreduce.reduce.java.opts="-Xmx5120M"

What is the correct way change the memory usage ? Also is there some documentation that explains these things to n00bs like me?


Solution

  • You haven't elaborated on what memory you are running low, physical or virtual.

    For both problems, take a look at Amazon's documentation: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html

    Usually the solution is to increase the amout of memory per mapper, and possibly reduce the number of mappers:

    s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapreduce.map.memory.mb=4000
    s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.tasktracker.map.tasks.maximum=2