Search code examples
apache-sparkhadoop-yarn

How to use --num-executors option with spark-submit?


I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below :

spark-submit --class WC.WordCount \
--num-executors 8 \
--executor-cores 5 \
--executor-memory 3584M \
...../<myjar>.jar \
/public/blahblahblah /user/blahblah

However its running with default number of executors which is 2. But I am able to override properties if I add

--master yarn

Can someone explain why it is so ? Interestingly , in my application code I am setting master as yarn-client:

val conf = new SparkConf()
   .setAppName("wordcount")
   .setMaster("yarn-client")
   .set("spark.ui.port","56487")

val sc = new SparkContext(conf)

Can someone throw some light as to how the option --master works


Solution

  • I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below

    It will not work (unless you override spark.master in conf/spark-defaults.conf file or similar so you don't have to specify it explicitly on the command line).

    The reason is that the default Spark master is local[*] and the number of executors is exactly one, i.e. the driver. That's just the local deployment environment. See Master URLs.

    As a matter of fact, num-executors is very YARN-dependent as you can see in the help:

    $ ./bin/spark-submit --help
    ...
     YARN-only:
      --num-executors NUM         Number of executors to launch (Default: 2).
                                  If dynamic allocation is enabled, the initial number of
                                  executors will be at least NUM.
    

    That explains why it worked when you switched to YARN. It is supposed to work with YARN (regardless of the deploy mode, i.e. client or cluster which is about the driver alone not executors).

    You may be wondering why it did not work with the master defined in your code then. The reason is that it is too late since the master has already been assigned on launch time when you started the application using spark-submit. That's exactly the reason why you should not specify deployment environment-specific properties in the code as:

    1. It may not always work (see the case with master)
    2. It requires that a code has to be recompiled every configuration change (and makes it a bit unwieldy)

    That's why you should be always using spark-submit to submit your Spark applications (unless you've got reasons not to, but then you'd know why and could explain it with ease).