I am trying to run a simple pyspark job in Amazon AWS and it is configured to use Yarn via spark-default.conf file. I am slightly confused about the Yarn deployment code.
I see some example code as below:
conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
And I am not sure how I should execute the spark job in this case when 'yarn-client' is specified. I usually do it as follows:
$spark-submit --deploy-mode client spark-job.py
But what is the difference between
$spark-submit --deploy-mode client spark-job.py
and
$spark-submit spark-job.py
How do I identify looking at spark logs whether the job ran in client mode or cluster or yarn-client?
The default --deploy-mode
is client.
So both the below spark-submit will run in client mode.
$spark-submit --deploy-mode client spark-job.py
and
$spark-submit spark-job.py
If you specify --master yarn
, now it will run in yarn in client mode.
Note: --master The master URL for the cluster (e.g. for standalone cluster spark://23.195.26.187:7077) Types of mode *standalone *YARN *Mesos *Kubernetes
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) *client *cluster