How to decide on number of parallel mapers/reducers along with Heap memory?

Say I have a EMR job running on 11 node cluster: m1.small master node while 10 m1.xlarge slave nodes.

Now one m1.xlarge node has 15 GB of RAM.

How to then decide on the number of parallel mappers and reducers which can be set?

My jobs are memory intensive and I would like to have more and more of heap allotted to JVM.

Another related question: If we set the following parameter:

 <property><name>mapred.child.java.opts</name><value>-Xmx4096m</value></property>
 <property><name>mapred.job.reuse.jvm.num.tasks</name><value>1</value></property>
 <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property>
 <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>2</value></property>

So will this 4GB be shared by 4 processes (2 mapper and 2 reducer) or will they all get 4GB each?

Solution

They will each get 4gb.

You should check what your heap setting is for the task trackers and the data nodes, then you'll have an idea of how much memory you have left over to allocate to children (the actual mappers / reducers).

Then it's just a balancing act. If you need more memory, you'll want less mappers / reducers, and vice versa.

Also try to keep in mind how many cores your CPU has, you don't want 100 map tasks on a single core. To tweak, it's best to monitor both heap usage and cpu utilization over time so you can fiddle with the knobs.