The difference between a hadoop installed by standalone and a hadoop included in spark?

I am a newbie in the Hadoop and the Spark domain. For a tutorial, I want to add some data to Hadoop and query it in Spark. So, I installed Hadoop standalone by following this and I downloaded the Spark version that does not include Hadoop. But I got an error like this. I tried setting the classpath to the Hadoop folder that I installed. The classpath was like this:

SPARK_DIST_CLASSPATH=%HADOOP_HOME%\share\hadoop\tools\lib\*

Apart from this, I tracked the Spark sources and found the reference to the environment variable, SPARK_DIST_CLASSPATH in the source. I still got error and inevitably, I installed Spark that includes Hadoop. I am curious whether or not I have other constraints.

Solution

No real differences between the standalone Hadoop and the one in Spark. To use Spark, you need Hadoop API at least for the IO. The error you are reporting is:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

It usually means you are not setting the path correctly.

Your file is located in a jar, something like hadoop-hdfs-{version}.jar. I see, on my computer, that this class is located in:

${HADOOP_HOME}/share/hadoop/hfs/hadoop-hdfs-2.7.3.jar

Please check if all the HADOOP environment variables are set correctly. The most important is HADOOP_HOME, as you can see on Linux (and most likely on Windows) any other variables depend on it.

When you use the startup script they will set even more environment variables which depend on HADOOP_HOME:

export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME 
export HADOOP_HDFS_HOME=$HADOOP_HOME 
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_ROOT_LOGGERi=INFO,console
export HADOOP_SECURITY_LOGGER=INFO,NullAppender
export HADOOP_INSTALL=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_YARN_HOME=$HADOOP_HOME

To know what is the hadoop classpath you have to use:

$ hadoop classpath 
/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/contrib/capacity-scheduler/*.jar

and SPARK_DIST_CLASSPATH has to be set to this value. Read this. Setting the value by hand is not a good idea.

Most likely you are using wrong paths. Be sure everything on Hadoop is working before firing up Spark.