Search code examples
apache-sparkgoogle-cloud-dataproc

Dataproc: dependency on cluster's HDFS when submitting Spark job to YARN from an edge node


I have a running Dataproc cluster. I wanted to submit a Spark job directly to YARN with spark-submit from an edge node outside of the cluster. Ideally spark-submit should only need access to the YARN resource manager address, so we configured firewall rules to only allow that, but the job submission failed because it needed to access the cluster's HDFS.

Questions:

  1. Why does spark-submit need to access HDFS?
  2. Is there a way to avoid that?

Solution

  • It has to do with the property spark.yarn.stagingDir 1. The dir is used by spark-submit to stage jars and configs, so YARN can access and distribute them to executors. The default value is the current user's home directory in HDFS, but it can be set to a GCS dir to avoid HDFS, for example:

    spark-submit --conf spark.yarn.stagingDir=gs://my-bucket/spark-staging/