Search code examples
apache-sparkhadoophadoop-yarn

What is the difference between submitting spark job to spark-submit and to hadoop directly?


I have noticed that in my project there are 2 ways of running spark jobs.

  1. First way is submitting a job to spark-submit file

    ./bin/spark-submit
    --class org.apache.spark.examples.SparkPi
    --master local[8]
    /path/to/examples.jar
    100

  2. Second way is to package java file into jar and run it via hadoop, while having Spark code inside MainClassName:

    hadoop jar JarFile.jar MainClassName

` What is the difference between these 2 ways? Which prerequisites I need to have in order to use either?


Solution

  • As you stated on the second way of running a spark job, packaging a java file with Spark classes and/or syntax is essentially wrapping your Spark job within a Hadoop job. This can have its disadvantages (mainly that your job gets directly dependent on the java and scala version you have on your system/cluster, but also some growing pains about the support between the different frameworks' versions). So in that case, the developer must be careful about the setup that the job will be run on on two different platforms, even if it seems a bit simpler for users of Hadoop which have a better grasp with Java and the Map/Reduce/Driver layout instead of the more already-tweaked nature of Spark and the sort-of-steep-learning-curve convenience of Scala.

    The first way of submitting a job is the most "standard" (as far as the majority of usage it can be seen online, so take this with a grain of salt), operating the execution of the job almost entirely within Spark (except if you store the output of your job or take its input from the HDFS, of course). By using this way, you are only somewhat dependent to Spark, keeping the strange ways of Hadoop (aka its YARN resource management) away from your job. And it can be significantly faster in execution time, since it's the most direct approach.