Search code examples
hadoopapache-sparkpysparkemramazon-emr

Running a Spark job with spark-submit across the whole cluster


I have recently set up a Spark cluster on Amazon EMR with 1 master and 2 slaves.

I can run pyspark, and submit jobs with spark-submit.

However, when I create a standalone job, like job.py, I create a SparkContext, like so:

sc=SparkContext("local", "App Name")

This doesn't seem right, but I'm not sure what to put there.

When I submit the job, I am sure it is not utilizing the whole cluster.

If I want to run a job against my entire cluster, say 4 processes per slave, what do I have to

a.) pass as arguments to spark-submit

b.) pass as arguments to SparkContext() in the script itself.


Solution

  • You can create spark context using

    conf = SparkConf().setAppName(appName)
    sc = SparkContext(conf=conf)
    

    and you have to submit the program to spark-submit using the following command for spark standalone cluster

    ./bin/spark-submit --master spark://<sparkMasterIP>:7077 code.py
    

    For Mesos cluster

    ./bin/spark-submit --master mesos://207.184.161.138:7077 code.py
    

    For YARN cluster

    ./bin/spark-submit --master yarn --deploy-mode cluster code.py
    

    For YARN master, the configuration would be read from HADOOP_CONF_DIR.