I am trying to run the spark-submit
command on my Hadoop cluster
Here is a summary of my Hadoop Cluster:
When I run the following command:
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I receive the following error:
java.io.FileNotFoundException: File file:/home/bitnami/sparkStaging/bitnami/.sparkStaging/application_1658417340986_0002/__spark_conf__.zip does not exist
I also get a similar error when trying to create a sparkSession using PySpark:
spark = SparkSession.builder.appName('appName').getOrCreate()
I have tried/verified the following
HADOOP_HOME
, SPARK_HOME
AND HADOOP_CONF_DIR
have been set in my .bashrc
fileSPARK_DIST_CLASSPATH
and HADOOP_CONF_DIR
have been defined in spark-env.sh
spark.master yarn
, spark.yarn.stagingDir file:///home/bitnami/sparkStaging
and spark.yarn.jars file:///opt/bitnami/hadoop/spark/jars/
in spark-defaults.conf
I believe spark.yarn.stagingDir
needs to be an HDFS path.
More specifically, the "YARN Staging directory" needs to be available on all Spark executors, not just a local file path from where you run spark-submit
The path that isn't found is being reported from the YARN cluster, where /home/bitnami
might not exist, or the Unix user running the Spark executor containers does not have access to that path.
Similarly, spark.yarn.jars
(or spark.yarn.archive
) should be HDFS paths because these will get downloaded, in parallel, across all executors.