Search code examples
apache-sparkhadooppysparkbitnami

FileNotFound error when running spark-submit


I am trying to run the spark-submit command on my Hadoop cluster Here is a summary of my Hadoop Cluster:

  • The cluster is built using 5 VirtualBox VM's connected on an internal network
  • There is 1 namenode and 4 datanodes created.
  • All the VM's were built from the Bitnami Hadoop Stack VirtualBox image

When I run the following command:

spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10

I receive the following error:

java.io.FileNotFoundException: File file:/home/bitnami/sparkStaging/bitnami/.sparkStaging/application_1658417340986_0002/__spark_conf__.zip does not exist

I also get a similar error when trying to create a sparkSession using PySpark:

spark = SparkSession.builder.appName('appName').getOrCreate()

I have tried/verified the following

  • environment variables: HADOOP_HOME, SPARK_HOME AND HADOOP_CONF_DIR have been set in my .bashrc file
  • SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
  • Added spark.master yarn, spark.yarn.stagingDir file:///home/bitnami/sparkStaging and spark.yarn.jars file:///opt/bitnami/hadoop/spark/jars/ in spark-defaults.conf

Solution

  • I believe spark.yarn.stagingDir needs to be an HDFS path.

    More specifically, the "YARN Staging directory" needs to be available on all Spark executors, not just a local file path from where you run spark-submit

    The path that isn't found is being reported from the YARN cluster, where /home/bitnami might not exist, or the Unix user running the Spark executor containers does not have access to that path.

    Similarly, spark.yarn.jars (or spark.yarn.archive) should be HDFS paths because these will get downloaded, in parallel, across all executors.