Search code examples
hadoopmapreducehadoop-yarnmrv2

Hadoop / Yarn (v0.23.3) Psuedo-Distributed Mode setup :: No job node


I just setup Hadoop/Yarn 2.x (specifically, v0.23.3) in Psuedo-Distributed mode.

I followed the instructions of a few blogs & websites which, more-or-less provide the same prescription for setting it up. I also followed the 3rd-Edition of O'reilly's Hadoop book (which ironically was the least helpful).

THE PROBLEM:

After running "start-dfs.sh" and then "start-yarn.sh", while all of the daemons
do start (as indicated by jps(1)), the Resource Manager web portal
(Here: http://localhost:8088/cluster/nodes) indicates 0 (zero) job-nodes in the
cluster. So while submitting the example/test Hadoop job indeed does get
scheduled, it pends forever because, I assume, the configuration doesn't see a
node to run it on.

Below are the steps I performed, including resultant configuration files.
Hopefully the community help me out... (And thank you in advance).

THE CONFIGURATION:

The following environment variables are set in both my and hadoop's UNIX account profiles: ~/.profile:

export HADOOP_HOME=/home/myself/APPS.d/APACHE_HADOOP.d/latest
  # Note: /home/myself/APPS.d/APACHE_HADOOP.d/latest -> hadoop-0.23.3

export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_INSTALL=${HADOOP_HOME}
export HADOOP_CLASSPATH=${HADOOP_HOME}/lib
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/conf
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop/conf
export JAVA_HOME=/usr/lib/jvm/jre

hadoop$ java -version

java version "1.7.0_06-icedtea<br>
OpenJDK Runtime Environment (fedora-2.3.1.fc17.2-x86_64)<br>
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)<br>

# Although the above shows OpenJDK, the same problem happens with Sun's JRE/JDK.

The NAMENODE & DATANODE directories, also specified in etc/hadoop/conf/hdfs-site.xml:

/home/myself/APPS.d/APACHE_HADOOP.d/latest/YARN_DATA.d/HDFS.d/DATANODE.d/
/home/myself/APPS.d/APACHE_HADOOP.d/latest/YARN_DATA.d/HDFS.d/NAMENODE.d/

Next, the various XML configuration files (again, YARN/MRv2/v0.23.3 here):

hadoop$ pwd; ls -l
/home/myself/APPS.d/APACHE_HADOOP.d/latest/etc/hadoop/conf
lrwxrwxrwx 1 hadoop hadoop   16 Sep 20 13:14 core-site.xml -> ../core-site.xml
lrwxrwxrwx 1 hadoop hadoop   16 Sep 20 13:14 hdfs-site.xml -> ../hdfs-site.xml
lrwxrwxrwx 1 hadoop hadoop   18 Sep 20 13:14 httpfs-site.xml -> ../httpfs-site.xml
lrwxrwxrwx 1 hadoop hadoop   18 Sep 20 13:14 mapred-site.xml -> ../mapred-site.xml
-rw-rw-r-- 1 hadoop hadoop   10 Sep 20 15:36 slaves
lrwxrwxrwx 1 hadoop hadoop   16 Sep 20 13:14 yarn-site.xml -> ../yarn-site.xml

core-site.xml

<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost/</value>
  </property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>

  <!-- Same problem whether this (legacy) stanza is included or not.  -->
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>

  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

hdfs-site.xml

<!-- hdfs-site.xml -->
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/HDFS.d/NAMENODE.d</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/HDFS.d/DATANODE.d</value>
  </property>
</configuration>

yarn-site.xml

<?xml version="1.0"?>
<!-- yarn-site.xml -->
<configuration>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>localhost:8032</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce.shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/TEMP.d</value>
  </property>
</configuration>

etc/hadoop/conf/saves

localhost
   # Community/friends, is this entry correct/needed for my psuedo-dist mode?

Miscellaneous wrap-up notes:

(1) As you may have gleaned from above, all files/directories are owned
    by the 'hadoop' UNIX user. There is a hadoop:hadoop, UNIX User and
    Group, respectively.

(2) The following command was run after the NAMENODE & DATANODE directories
    (listed above) were created (and whose paths were entered into
    hdfs-site.xml):

    hadoop$ hadoop namenode -format

(3) Next, I ran "start-dfs.sh", then "start-yarn.sh".
    Here is jps(1) output:

hadoop@e6510$ jps
    21979 DataNode
    22253 ResourceManager
    22384 NodeManager
    22156 SecondaryNameNode
    21829 NameNode
    22742 Jps

Thank you!


Solution

  • After much toil on this problem without success (and trust me I tried it all), I instituted hadoop using a different solution. Whereas above I downloaded a gzip/tar ball of the hadoop distribution (again v0.23.3) from one of the download mirrors, this time I used the Caldera CDH distribution of RPM packages, which I installed via their YUM repos. In hopes that this will help someone, here are the detailed steps.

    Step-1:

    For Hadoop 0.20.x (MapReduce version 1):

      # rpm -Uvh http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm
      # rpm --import http://archive.cloudera.com/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
      # yum install hadoop-0.20-conf-pseudo
    

    -or-

    For Hadoop 0.23.x (MapReduce version 2):

      # rpm -Uvh http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.noarch.rpm
      # rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
      # yum install hadoop-conf-pseudo
    

    In both cases above, installing that "psuedo" package (which stands for "pseudo-distributed Hadoop" mode), will alone conveniently trigger the installation of all the other necessary packages you'll need (via dependency resolution).

    Step-2:

    Install Sun/Oracle's Java JRE (if you haven't already done so). You can install it via the RPM that they provide, or the gzip/tar ball portable version. It doesn't matter which as long as you set and export the "JAVA_HOME" environment appropriately, and ensure ${JAVA_HOME}/bin/java is in your path.

      # echo $JAVA_HOME; which java
      /home/myself/APPS.d/JAVA-JRE.d/jdk1.7.0_07
      /home/myself/APPS.d/JAVA-JRE.d/jdk1.7.0_07/bin/java
    

    Note: I actually create a symlink called "latest" and point/re-point it to the JAVA version specific directory whenever I update the JAVA. I was explicit above for the reader's understanding.

    Step-3: Format hdfs as the "hdfs" Unix user (created during "yum install" above).

      # sudo su hdfs -c "hadoop namenode -format"
    

    Step-4:

    Manually start the hadoop daemons.

      for file in `ls /etc/init.d/hadoop*`
      do
      {
         ${file} start
      }
      done
    

    Step-5:

    Check to see if things are working. The following is for MapReduce v1 (It's not that much different for MapReduce v2 at this superficial level).

      root# jps
       23104 DataNode
       23469 TaskTracker
       23361 SecondaryNameNode
       23187 JobTracker
       23267 NameNode
       24754 Jps
    
       # Do the next commands as yourself (not as "root").
       myself$ hadoop fs -mkdir /foo
       myself$ hadoop fs -rmr /foo
       myself$ hadoop jar /usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u5-examples.jar pi 2 100000
    

    I hope this helped!