Search code examples
apache-sparkhadoop-yarnamazon-emr

How does a MasterNode fit into a Spark cluster?


I'm getting a little confused with how to setup my Spark configuration for workloads using YARN as the resource manager. I've got a small cluster spun up right now with 1 master node and 2 core nodes.

Do I include the master node when calculating the number of executors or no?

Do I leave out 1 core for every node to account for Yarn management?

Am I supposed to designate the master node for anything in particular in Spark configurations?


Solution

    1. Master node shouldn't be taken into account to calculate number of executors
    2. Each node is actually EC2 instance with operating system so you have to leave 1 or more cores for system tasks and yarn agents
    3. Master node can be used to run spark driver. For this start EMR cluster in client mode from master node by adding arguments --master yarn --deploy-mode client to spark-submit command. Keep in mind following:

      Cluster mode allows you to submit work using S3 URIs. Client mode requires that you put the application in the local file system on the cluster master node

    To do all preparation work (copy libs, scripts etc to a master node) you can setup a separate step and then run spark-submit --master yarn --deploy-mode client command as next step.