Search code examples
amazon-web-servicesapache-sparkhadoop-yarnamazon-emr

How EMR command-runner submit jobs


How can I figure out what exactly happens after you deploy your EMR Step to cluster with master equals to local[x]?

How command-runner.jar submit job to EMR's master? If I pass "--executor-cores 4" as spark-submit argument, but at Launcher I create session with local[8] how much cores I will get for executor? How muh executors it'll create?

I failed to find this out at AWS documentation. Example:

SomeStep: 
  Type: AWS::EMR::Step
  Properties: 
    ActionOnFailure: CONTINUE
    HadoopJarStep: 
      Args: 
        - "spark-submit"
        - "--deploy-mode"
        - "cluster"
        - "--executor-cores"
        - "4"
        - "--class"
        - "com.psyquation.batch.analytic.Driver"

        {
          "Fn::Sub": "s3://some-bucker/my-app.jar"
        },
      Jar: command-runner.jar
      MainClass: com.somepackage.Launcher
    Name: SomeStep
    JobFlowId: !Ref SomeCluster

And now inside com.somepackage.Launcher:

SparkSession.Builder builder = SparkSession.builder()
    // some configs ...
    .master("local[8]")
    .getOrCreate();

Solution

  • If you set the master to “local”, it won’t create any executors at all. The entire application (driver and tasks) will be run in a single JVM process on the master instance. Using “local[8]” just makes it such that this single process can run 8 tasks in parallel (in different threads).

    A master url of “local” is mainly used for (local) development/testing, not for running on a cluster. You most likely should not set the master url at all, so that it will instead use “yarn”, which is the default on EMR.