Running a Spark job with spark-submit across the whole cluster

I have recently set up a Spark cluster on Amazon EMR with 1 master and 2 slaves.

I can run pyspark, and submit jobs with spark-submit.

However, when I create a standalone job, like job.py, I create a SparkContext, like so:

sc=SparkContext("local", "App Name")

This doesn't seem right, but I'm not sure what to put there.

When I submit the job, I am sure it is not utilizing the whole cluster.

If I want to run a job against my entire cluster, say 4 processes per slave, what do I have to

a.) pass as arguments to spark-submit

b.) pass as arguments to SparkContext() in the script itself.

Solution

You can create spark context using

conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf)

and you have to submit the program to spark-submit using the following command for spark standalone cluster

./bin/spark-submit --master spark://<sparkMasterIP>:7077 code.py

For Mesos cluster

./bin/spark-submit --master mesos://207.184.161.138:7077 code.py

For YARN cluster

./bin/spark-submit --master yarn --deploy-mode cluster code.py

For YARN master, the configuration would be read from HADOOP_CONF_DIR.