Search code examples
hadoopamazon-web-servicesapache-sparkhadoop-yarn

Spark Submit Issue


I am trying to run a fat jar on a Spark cluster using Spark submit. I made the cluster using "spark-ec2" executable in Spark bundle on AWS.

The command I am using to run the jar file is

bin/spark-submit --class edu.gatech.cse8803.main.Main --master yarn-cluster ../src1/big-data-hw2-assembly-1.0.jar

In the beginning it was giving me the error that at least one of the HADOOP_CONF_DIR or YARN_CONF_DIR environment variable must be set. I didn't know what to set them to, so I used the following command

export HADOOP_CONF_DIR=/mapreduce/conf

Now the error has changed to

Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
Run with --help for usage help or --verbose for debug output

The home directory structure is as follows

ephemeral-hdfs  hadoop-native  mapreduce  persistent-hdfs  scala  spark  spark-ec2  src1  tachyon

I even set the YARN_CONF_DIR variable to the same value as HADOOP_CONF_DIR, but the error message is not changing. I am unable to find any documentation that highlights this issue, most of them just mention these two variables and give no further details.


Solution

  • You need to compile spark against Yarn to use it.

    Follow the steps explained here: https://spark.apache.org/docs/latest/building-spark.html

    Maven:

    build/mvn -Pyarn -Phadoop-2.x -Dhadoop.version=2.x.x -DskipTests clean package
    

    SBT:

    build/sbt -Pyarn -Phadoop-2.x assembly
    

    You can also download a pre-compiled version here: http://spark.apache.org/downloads.html (choose a "pre-built for Hadoop")