Search code examples
pythonpysparkhadoop-yarnamazon-emr

Confusion using Yarn Resource Manager


I am trying to run a simple pyspark job in Amazon AWS and it is configured to use Yarn via spark-default.conf file. I am slightly confused about the Yarn deployment code.

I see some example code as below:

conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)

And I am not sure how I should execute the spark job in this case when 'yarn-client' is specified. I usually do it as follows:

$spark-submit --deploy-mode client spark-job.py

But what is the difference between

$spark-submit --deploy-mode client spark-job.py

and

$spark-submit spark-job.py

How do I identify looking at spark logs whether the job ran in client mode or cluster or yarn-client?


Solution

  • The default --deploy-mode is client. So both the below spark-submit will run in client mode.

    $spark-submit --deploy-mode client spark-job.py
    

    and

    $spark-submit spark-job.py
    

    If you specify --master yarn, now it will run in yarn in client mode.

    Note: --master The master URL for the cluster (e.g. for standalone cluster spark://23.195.26.187:7077) Types of mode *standalone *YARN *Mesos *Kubernetes

    --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) *client *cluster