Search code examples
hadoopemrmrjob

How to set the number of parallel reducers on EMR?


I am running a job on EMR with mrjob; I am using AMI version 2.4.7 and Hadoop version 1.0.3.

I want to specify the number of reducers for a job, because I want to provide a higher parallellism to the next one. Reading the answers to the other questions on this site, I gathered that I should set these parameters, and so I did: mapred.reduce.tasks=576 mapred.tasktracker.reduce.tasks.maximum=24

However, it seems like the second option is not picked up: both the EMR and the Hadoop interfaces report that there are 576 reduce tasks to run, but the capacity of the cluster remains at 72 (r3.8xlarge instances).

I even see that the option is set in var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_XXX/job.xml:<property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>24</value></property>. Still, only the default number (9) of actual reducers are running at the same time.

Why is the option not picked up by EMR? Or is there a different way to force a higher number of reducers on an instance?


Solution

  • With Hadoop 1, the map and reduce slots per node are set at the daemon level and thus require a restart of the TaskTracker daemons if the value is changed.

    On EMR, the default number of slots per instance type can be found at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_H1.0.3.html.

    In order to change these default values you will need to use a bootstrap action like configure-hadoop to modify the mapred.tasktracker.reduce.tasks.maximum on the cluster before Hadoop daemons start. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop for more details.

    Example (will need to be modified to match whatever interface is being used to create the cluster):

    s3://<region>.elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.tasktracker.reduce.tasks.maximum=24
    

    Please note if changing the number of slots per node be sure to adjust mapred.child.java.opts to provide an upper memory amount that is reasonable for the amount of memory available.