We are trying to migrate our jobs to Hadoop 2 (Hadoop 2.8.1, single node cluster, to be precise) from Hadoop 1.0.3. We are using YARN to manage our map-reduce jobs. One of the differences that we have noticed is the presence of two Linux processes for each map or reduce task that is planned for execution. For example, for any of our reduce tasks, we find these two executing processes:
hadoop 124692 124690 0 12:33 ? 00:00:00 /bin/bash -c /opt/java/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx5800M -XX:-UsePerfData -Djava.io.tmpdir=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1510651062679_0001/container_1510651062679_0001_01_000278/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Dyarn.app.mapreduce.shuffle.logger=INFO,shuffleCLA -Dyarn.app.mapreduce.shuffle.logfile=syslog.shuffle -Dyarn.app.mapreduce.shuffle.log.filesize=0 -Dyarn.app.mapreduce.shuffle.log.backups=0 org.apache.hadoop.mapred.YarnChild 192.168.101.29 33929 attempt_1510651062679_0001_r_000135_0 278 1>/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278/stdout 2>/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278/stderr
hadoop 124696 124692 74 12:33 ? 00:10:30 /opt/java/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx5800M -XX:-UsePerfData -Djava.io.tmpdir=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1510651062679_0001/container_1510651062679_0001_01_000278/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Dyarn.app.mapreduce.shuffle.logger=INFO,shuffleCLA -Dyarn.app.mapreduce.shuffle.logfile=syslog.shuffle -Dyarn.app.mapreduce.shuffle.log.filesize=0 -Dyarn.app.mapreduce.shuffle.log.backups=0
The second process is a child of the first one. All in all, we see that the overall number of processes during our job execution is much higher than it was with Hadoop 1.0.3, where only one process was executing for each map or reduce task.
a) Could this be a reason for the job executing quite slower than it does with Hadoop 1.0.3 ?
b) Are those two processes the intended way it all works ?
Thank you in advance for your advice.
On a close check you will find
Pid 124692 is /bin/bash
Pid 124696 is /opt/java/bin/java
/bin/bash is the container process which spawns Java process within the enclosed environment (CPU , RAM restricted to a container)
You can think of this as A Virtual Machine inside which you can run your own process. Both virtual machine and the process running inside it will have a parent child relationship.
Please read about Linux container in details to know more about it .