I have been trying to to spark-submit to my cloudera cluster for a few weeks. I really hope someone out there know how this works.
I created a script which calls spark-submit with all the required arguments. The screen dumps out the following lines
Using properties file: null
Using properties file: null
Parsed arguments:
master yarn
deployMode cluster
executorMemory null
executorCores null
totalExecutorCores null
propertiesFile null
driverMemory null
driverCores null
driverExtraClassPath /home/bruce/workspace1/spark-cloudera/yarn/stable/target/spark-yarn_2.10-1.0.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.3.0-cdh5.1.0/hadoop-yarn-client-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-common/2.3.0-cdh5.1.0/hadoop-common-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.3.0-cdh5.1.0/hadoop-yarn-api-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.3.0-cdh5.1.0/hadoop-yarn-common-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-auth/2.3.0-cdh5.1.0/hadoop-auth-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar
driverExtraLibraryPath null
driverExtraJavaOptions null
supervise false
queue null
numExecutors null
files null
pyFiles null
archives null
mainClass org.apache.spark.examples.SparkPi
primaryResource file:/home/bruce/workspace1/spark-cloudera/examples/target/scala-2.10/spark-examples-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar
name org.apache.spark.examples.SparkPi
childArgs [10]
jars null
verbose true
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
the call gets stuck for a very long time then quits with Connection refused.
What I don't understand is the argument specifies using YarnClient, but no where does it indicate it knows how to contact the yarn resource manager, not the ip, not the port. The submission is made on my lap top, the cluster is on the neighboring subnet. How does spark-submit figure out how to contact the yarn service?
From the Spark Documentation
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to the dfs and connect to the YARN ResourceManager.