I am trying to overcome the following error in a hadoop streaming job on EMR.
Container [pid=30356,containerID=container_1391517294402_0148_01_000021] is running beyond physical memory limits
I tried searching for answers but the one I found isn't working. My job is launched as shown below.
hadoop jar ../.versions/2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \
-input determinations/part-00000 \
-output determinations/aggregated-0 \
-mapper cat \
-file ./det_maker.py \
-reducer det_maker.py \
-Dmapreduce.reduce.java.opts="-Xmx5120M"
The last line above is supposed to do the trick as far as I understand, but I get the error:
ERROR streaming.StreamJob: Unrecognized option: -Dmapreduce.reduce.java.opts="-Xmx5120M"
What is the correct way change the memory usage ? Also is there some documentation that explains these things to n00bs like me?
You haven't elaborated on what memory you are running low, physical or virtual.
For both problems, take a look at Amazon's documentation: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html
Usually the solution is to increase the amout of memory per mapper, and possibly reduce the number of mappers:
s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapreduce.map.memory.mb=4000
s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.tasktracker.map.tasks.maximum=2