How to tune the Hadoop MapReduce parameters on Amazon EMR?

My MR job ended at map 100% reduce 35% with lots of error messages similar to running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container.

My input *.bz2 file is about 4GB, if I uncompress it, the size of it will be about 38GB, it took about one hour to run this job with one Master and two slavers on the Amazon EMR.

My questions are
- Why this job used so much memory?
- Why this job took about one hour? Usually running a 40GB wordcount job on a small 4-node cluster takes about 10 mins.
- How to tune the MR parameters to solve this problem?
- Which Amazon EC2 Instance types are the good fit to solve this problem?

Please refer to the following log:
- Physical memory (bytes) snapshot=43327889408 => 43.3GB
- Virtual memory (bytes) snapshot=108950675456 => 108.95GB
- Total committed heap usage (bytes)=34940649472 => 34.94GB

My proposed solutions are as follows, but I'm not sure if they are correct solutions or not
- use larger Amazon EC2 Instance which is at least 8GB in memory
- tune the MR parameters using the following codes

Version 1:

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest1");
//don't kill the container, if the physical memory exceeds "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb"
conf.setBoolean("yarn.nodemanager.pmem-check-enabled", false);
conf.setBoolean("yarn.nodemanager.vmem-check-enabled", false);

Version 2:

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest2");
//conf.set("mapreduce.input.fileinputformat.split.minsize","3073741824");                                                                   
conf.set("mapreduce.map.memory.mb", "8192");                                     
conf.set("mapreduce.map.java.opts", "-Xmx6144m");                                         
conf.set("mapreduce.reduce.memory.mb", "8192");                                         
conf.set("mapreduce.reduce.java.opts", "-Xmx6144m");

Log:

15/11/08 11:37:27 INFO mapreduce.Job:  map 100% reduce 35%
15/11/08 11:37:27 INFO mapreduce.Job: Task Id : attempt_1446749367313_0006_r_000006_2, Status : FAILED
Container [pid=24745,containerID=container_1446749367313_0006_01_003145] is running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container.
Dump of the process-tree for container_1446749367313_0006_01_003145 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 24745 24743 24745 24745 (bash) 0 0 9658368 291 /bin/bash -c /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild **.***.***.*** 32846 attempt_1446749367313_0006_r_000006_2 3145 1>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stdout 2>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stderr  
    |- 24749 24745 24745 24745 (java) 14124 1281 3910426624 789477 /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild **.***.***.*** 32846 attempt_1446749367313_0006_r_000006_2 3145 

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

15/11/08 11:37:28 INFO mapreduce.Job:  map 100% reduce 25%
15/11/08 11:37:30 INFO mapreduce.Job:  map 100% reduce 26%
15/11/08 11:37:37 INFO mapreduce.Job:  map 100% reduce 27%
15/11/08 11:37:42 INFO mapreduce.Job:  map 100% reduce 28%
15/11/08 11:37:53 INFO mapreduce.Job:  map 100% reduce 29%
15/11/08 11:37:57 INFO mapreduce.Job:  map 100% reduce 34%
15/11/08 11:38:02 INFO mapreduce.Job:  map 100% reduce 35%
15/11/08 11:38:13 INFO mapreduce.Job:  map 100% reduce 36%
15/11/08 11:38:22 INFO mapreduce.Job:  map 100% reduce 37%
15/11/08 11:38:35 INFO mapreduce.Job:  map 100% reduce 42%
15/11/08 11:38:36 INFO mapreduce.Job:  map 100% reduce 100%
15/11/08 11:38:36 INFO mapreduce.Job: Job job_1446749367313_0006 failed with state FAILED due to: Task failed task_1446749367313_0006_r_000001
Job failed as tasks failed. failedMaps:0 failedReduces:1

15/11/08 11:38:36 INFO mapreduce.Job: Counters: 43
    File System Counters
        FILE: Number of bytes read=11806418671
        FILE: Number of bytes written=22240791936
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=16874
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=59
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
        S3: Number of bytes read=3942336319
        S3: Number of bytes written=0
        S3: Number of read operations=0
        S3: Number of large read operations=0
        S3: Number of write operations=0
    Job Counters 
        Failed reduce tasks=22
        Killed reduce tasks=5
        Launched map tasks=59
        Launched reduce tasks=27
        Data-local map tasks=59
        Total time spent by all maps in occupied slots (ms)=114327828
        Total time spent by all reduces in occupied slots (ms)=131855700
        Total time spent by all map tasks (ms)=19054638
        Total time spent by all reduce tasks (ms)=10987975
        Total vcore-seconds taken by all map tasks=19054638
        Total vcore-seconds taken by all reduce tasks=10987975
        Total megabyte-seconds taken by all map tasks=27438678720
        Total megabyte-seconds taken by all reduce tasks=31645368000
    Map-Reduce Framework
        Map input records=728795619
        Map output records=728795618
        Map output bytes=50859151614
        Map output materialized bytes=10506705085
        Input split bytes=16874
        Combine input records=0
        Spilled Records=1457591236
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=150143
        CPU time spent (ms)=14360870
        Physical memory (bytes) snapshot=43327889408
        Virtual memory (bytes) snapshot=108950675456
        Total committed heap usage (bytes)=34940649472
    File Input Format Counters 
        Bytes Read=0

Solution

I am not sure of Amazon EMR. So few points to consider regarding map reduce:

bzip2 is slower, although it compresses better than gzip. bzip2’s decompression speed is faster than its compression speed, but it is still slower than the other formats. So at a high level, you already have this compared to 40gb word count program which ran in ten minutes.(assuming that 40gb program don't have compression). Next question is, BUT HOW MUCH SLOWER
However, your job is still failing after one hour. Please confirm this. So only when the job runs successfully, can we thing of performance. For this reason, lets think of why is it failing. You were getting memory error. Also based on error, a container is failed during the reducer phase(as mapper phase is completed 100%). Mostly not even one reducer might have succeeded. Even though 32% might trick you to think that some reducers ran, that % could be due to preparing clean up work before first reducer runs. One way to confirm is, see if you have got any reducer output file generated.

Once confirming that, none of the reducer ran, you can increase the memory for containers as per your version 2.

Your version 1 will help you to see if only a specific container is causing issue and allowing the job to complete.