I am running a job on EMR with mrjob; I am using AMI version 2.4.7 and Hadoop version 1.0.3.
I want to specify the number of reducers for a job, because I want to provide a higher parallellism to the next one. Reading the answers to the other questions on this site, I gathered that I should set these parameters, and so I did:
mapred.reduce.tasks=576
mapred.tasktracker.reduce.tasks.maximum=24
However, it seems like the second option is not picked up: both the EMR and the Hadoop interfaces report that there are 576 reduce tasks to run, but the capacity of the cluster remains at 72 (r3.8xlarge instances).
I even see that the option is set in var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_XXX/job.xml:<property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>24</value></property>
. Still, only the default number (9) of actual reducers are running at the same time.
Why is the option not picked up by EMR? Or is there a different way to force a higher number of reducers on an instance?
With Hadoop 1, the map and reduce slots per node are set at the daemon level and thus require a restart of the TaskTracker daemons if the value is changed.
On EMR, the default number of slots per instance type can be found at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_H1.0.3.html.
In order to change these default values you will need to use a bootstrap action like configure-hadoop
to modify the mapred.tasktracker.reduce.tasks.maximum
on the cluster before Hadoop daemons start. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop for more details.
Example (will need to be modified to match whatever interface is being used to create the cluster):
s3://<region>.elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.tasktracker.reduce.tasks.maximum=24
Please note if changing the number of slots per node be sure to adjust mapred.child.java.opts
to provide an upper memory amount that is reasonable for the amount of memory available.