Search code examples
apache-sparkpysparkhadoop-yarn

Python+PySpark File locally connecting to a Remote HDFS/Spark/Yarn Cluster


I've been playing around with HDFS and Spark. I've set up a five node cluster on my network running HDFS, Spark, and managed by Yarn. Workers are running in client mode. From the master node, I can launch the PySpark shell just fine. Running example jars, the job is split up to the worker nodes and executes nicely.

I have a few questions on whether and how to run python/Pyspark files against this cluster.

  1. If I have a python file with a PySpark calls elsewhere else, like on my local dev laptop or a docker container somewhere, is there a way to run or submit this file locally and have it executed on the remote Spark cluster? Methods that I'm wondering about involve running spark-submit in the local/docker environment and but the file has SparkSession.builder.master() configured to the remote cluster.

  2. Related, I see a configuration for --master in spark-submit, but the only yarn option is to pass "yarn" which seems to only queue locally? Is there a way to specify remote yarn?

  3. If I can set up and run the file remotely, how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000, or do I submit it to one of the Yarn ports?

TIA!


Solution

  • way to run or submit this file locally and have it executed on the remote Spark cluster

    Yes, well "YARN", not "remote Spark cluster". You set --master=yarn when running with spark-submit, and this will run against the configured yarn-site.xml in HADOOP_CONF_DIR environment variable. You can define this at the OS level, or in spark-env.sh.

    You can also use SparkSession.builder.master('yarn') in code. If both options are supplied, one will get overridden.

    To run fully "in the cluster", also set --deploy-mode=cluster

    Is there a way to specify remote yarn?

    As mentioned, this is configured from yarn-site.xml for providing resourcemanager location(s).

    how do I set up SparkSession.builder.master()? Is the url just to the hdfs:// url to port 9000

    No - The YARN resource manager has its own RPC protocol, not hdfs:// ... You can use spark.read("hdfs://namenode:port/path") to read HDFS files, though. As mentioned, .master('yarn') or --master yarn are the only configs you need that are specific for Spark.


    If you want to use Docker containers, YARN does support this, but Spark's Kubernetes master will be easier to setup, and you can use Hadoop Ozone or MinIO rather than HDFS in Kubernetes.