My MR job ended at map 100% reduce 35% with lots of error messages similar to running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container.
My input *.bz2
file is about 4GB, if I uncompress it, the size of it will be about 38GB, it took about one hour to run this job with one Master
and two slavers
on the Amazon EMR.
My questions are
- Why this job used so much memory?
- Why this job took about one hour? Usually running a 40GB wordcount job on a small 4-node cluster takes about 10 mins.
- How to tune the MR parameters to solve this problem?
- Which Amazon EC2 Instance types are the good fit to solve this problem?
Please refer to the following log:
- Physical memory (bytes) snapshot=43327889408 => 43.3GB
- Virtual memory (bytes) snapshot=108950675456 => 108.95GB
- Total committed heap usage (bytes)=34940649472 => 34.94GB
My proposed solutions are as follows, but I'm not sure if they are correct solutions or not
- use larger Amazon EC2 Instance which is at least 8GB in memory
- tune the MR parameters using the following codes
Version 1:
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest1");
//don't kill the container, if the physical memory exceeds "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb"
conf.setBoolean("yarn.nodemanager.pmem-check-enabled", false);
conf.setBoolean("yarn.nodemanager.vmem-check-enabled", false);
Version 2:
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest2");
//conf.set("mapreduce.input.fileinputformat.split.minsize","3073741824");
conf.set("mapreduce.map.memory.mb", "8192");
conf.set("mapreduce.map.java.opts", "-Xmx6144m");
conf.set("mapreduce.reduce.memory.mb", "8192");
conf.set("mapreduce.reduce.java.opts", "-Xmx6144m");
Log:
15/11/08 11:37:27 INFO mapreduce.Job: map 100% reduce 35%
15/11/08 11:37:27 INFO mapreduce.Job: Task Id : attempt_1446749367313_0006_r_000006_2, Status : FAILED
Container [pid=24745,containerID=container_1446749367313_0006_01_003145] is running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container.
Dump of the process-tree for container_1446749367313_0006_01_003145 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 24745 24743 24745 24745 (bash) 0 0 9658368 291 /bin/bash -c /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild **.***.***.*** 32846 attempt_1446749367313_0006_r_000006_2 3145 1>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stdout 2>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stderr
|- 24749 24745 24745 24745 (java) 14124 1281 3910426624 789477 /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild **.***.***.*** 32846 attempt_1446749367313_0006_r_000006_2 3145
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
15/11/08 11:37:28 INFO mapreduce.Job: map 100% reduce 25%
15/11/08 11:37:30 INFO mapreduce.Job: map 100% reduce 26%
15/11/08 11:37:37 INFO mapreduce.Job: map 100% reduce 27%
15/11/08 11:37:42 INFO mapreduce.Job: map 100% reduce 28%
15/11/08 11:37:53 INFO mapreduce.Job: map 100% reduce 29%
15/11/08 11:37:57 INFO mapreduce.Job: map 100% reduce 34%
15/11/08 11:38:02 INFO mapreduce.Job: map 100% reduce 35%
15/11/08 11:38:13 INFO mapreduce.Job: map 100% reduce 36%
15/11/08 11:38:22 INFO mapreduce.Job: map 100% reduce 37%
15/11/08 11:38:35 INFO mapreduce.Job: map 100% reduce 42%
15/11/08 11:38:36 INFO mapreduce.Job: map 100% reduce 100%
15/11/08 11:38:36 INFO mapreduce.Job: Job job_1446749367313_0006 failed with state FAILED due to: Task failed task_1446749367313_0006_r_000001
Job failed as tasks failed. failedMaps:0 failedReduces:1
15/11/08 11:38:36 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=11806418671
FILE: Number of bytes written=22240791936
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=16874
HDFS: Number of bytes written=0
HDFS: Number of read operations=59
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
S3: Number of bytes read=3942336319
S3: Number of bytes written=0
S3: Number of read operations=0
S3: Number of large read operations=0
S3: Number of write operations=0
Job Counters
Failed reduce tasks=22
Killed reduce tasks=5
Launched map tasks=59
Launched reduce tasks=27
Data-local map tasks=59
Total time spent by all maps in occupied slots (ms)=114327828
Total time spent by all reduces in occupied slots (ms)=131855700
Total time spent by all map tasks (ms)=19054638
Total time spent by all reduce tasks (ms)=10987975
Total vcore-seconds taken by all map tasks=19054638
Total vcore-seconds taken by all reduce tasks=10987975
Total megabyte-seconds taken by all map tasks=27438678720
Total megabyte-seconds taken by all reduce tasks=31645368000
Map-Reduce Framework
Map input records=728795619
Map output records=728795618
Map output bytes=50859151614
Map output materialized bytes=10506705085
Input split bytes=16874
Combine input records=0
Spilled Records=1457591236
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=150143
CPU time spent (ms)=14360870
Physical memory (bytes) snapshot=43327889408
Virtual memory (bytes) snapshot=108950675456
Total committed heap usage (bytes)=34940649472
File Input Format Counters
Bytes Read=0
I am not sure of Amazon EMR. So few points to consider regarding map reduce:
bzip2 is slower, although it compresses better than gzip. bzip2’s decompression speed is faster than its compression speed, but it is still slower than the other formats. So at a high level, you already have this compared to 40gb word count program which ran in ten minutes.(assuming that 40gb program don't have compression). Next question is, BUT HOW MUCH SLOWER
However, your job is still failing after one hour. Please confirm this. So only when the job runs successfully, can we thing of performance. For this reason, lets think of why is it failing. You were getting memory error. Also based on error, a container is failed during the reducer phase(as mapper phase is completed 100%). Mostly not even one reducer might have succeeded. Even though 32% might trick you to think that some reducers ran, that % could be due to preparing clean up work before first reducer runs. One way to confirm is, see if you have got any reducer output file generated.
Once confirming that, none of the reducer ran, you can increase the memory for containers as per your version 2.
Your version 1 will help you to see if only a specific container is causing issue and allowing the job to complete.