Search code examples
hadoopapache-sparkdockerhadoop-yarnapplicationmanager

Application manager in YARN setup


I have setup of 1 Name Node, 2 Data Nodes, 1 Resource Manager and 2 Node Managers.All components are running as docker containers. Every time when I execute a spark submit (yarn cluster mode) from 2 machines (2 clients), job gets completed in a sequential manner. Job1 and Job2 both goes in Accepted state, Job1 turns to Running and Finished state and then Job2 gets picked and finishes its execution. Is there any way these jobs gets executed in parallel fashion? How does Application manager picks these tasks to give it to node manager?


Solution

  • The cluster setup is using YARN Capacity Scheduler, which is default in most of the available Hadoop distributions. If multiple jobs are submitted by the same user, they enter the same user queue which follows FIFO. This is the default behaviour of capacity scheduler.

    Fair Scheduler can be configured to run jobs in parallel by sharing the available resources.

    Add this property to yarn-site.xml

    <property>
      <name>yarn.resourcemanager.scheduler.class</name>
      <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
    </property>
    

    Configure the fair scheduler queues in an allocation file,

    <property>
          <name>yarn.scheduler.fair.allocation.file</name>
          <value>/path/to/allocation-file.xml</value>
    </property>
    

    If this property is not configured, a queue per user will be created by default.