Search code examples
apache-sparkhadoop2

The difference between a hadoop installed by standalone and a hadoop included in spark?


I am a newbie in the Hadoop and the Spark domain. For a tutorial, I want to add some data to Hadoop and query it in Spark. So, I installed Hadoop standalone by following this and I downloaded the Spark version that does not include Hadoop. But I got an error like this. I tried setting the classpath to the Hadoop folder that I installed. The classpath was like this:

SPARK_DIST_CLASSPATH=%HADOOP_HOME%\share\hadoop\tools\lib\* 

Apart from this, I tracked the Spark sources and found the reference to the environment variable, SPARK_DIST_CLASSPATH in the source. I still got error and inevitably, I installed Spark that includes Hadoop. I am curious whether or not I have other constraints.


Solution

  • No real differences between the standalone Hadoop and the one in Spark. To use Spark, you need Hadoop API at least for the IO. The error you are reporting is:

    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

    It usually means you are not setting the path correctly.

    Your file is located in a jar, something like hadoop-hdfs-{version}.jar. I see, on my computer, that this class is located in:

    ${HADOOP_HOME}/share/hadoop/hfs/hadoop-hdfs-2.7.3.jar
    

    Please check if all the HADOOP environment variables are set correctly. The most important is HADOOP_HOME, as you can see on Linux (and most likely on Windows) any other variables depend on it.

    When you use the startup script they will set even more environment variables which depend on HADOOP_HOME:

    export HADOOP_MAPRED_HOME=$HADOOP_HOME 
    export HADOOP_COMMON_HOME=$HADOOP_HOME 
    export HADOOP_HDFS_HOME=$HADOOP_HOME 
    export YARN_HOME=$HADOOP_HOME
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    export HADOOP_ROOT_LOGGERi=INFO,console
    export HADOOP_SECURITY_LOGGER=INFO,NullAppender
    export HADOOP_INSTALL=$HADOOP_HOME
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export HADOOP_PREFIX=$HADOOP_HOME
    export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
    export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
    export HADOOP_YARN_HOME=$HADOOP_HOME
    

    To know what is the hadoop classpath you have to use:

    $ hadoop classpath 
    /opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/contrib/capacity-scheduler/*.jar
    

    and SPARK_DIST_CLASSPATH has to be set to this value. Read this. Setting the value by hand is not a good idea.

    Most likely you are using wrong paths. Be sure everything on Hadoop is working before firing up Spark.