python amazon-web-services apache-spark amazon-emr

Spark EMR: Need to configure multiple spark-submits to work parallel in EMR cluster

I am new to AWS EMR and I face a problem of making multiple spark-submits to run parallel.

I have some jobs scheduled to run every 10 minutes and a job that runs every 6 hours. The cluster has enough resources to run them all at the same time, but the default config puts them into the single root. default queue that makes them run sequentially. This is not what I want. What do I have to write in configuration files?

I've tried adding queues "1", "2" and "3" to root queue (in yarn-site.xml) and spark-submitting each job into a separate queue. But they still run sequentially (not parallel as I want).

spark-submit --queue 1 --num-executors 1  s3://bucket/some-job.py

spark-submit --queue 2 --num-executors 1  s3://bucket/some-job.py

Solution

I've found an unexpected solution.

Seems like current Hadoop documentation does not reflect correct configs for yarn in AWS EMR, so I've used trial and error method to find a way to make it work.

Instead of Capacity scheduler I used Fair scheduler. But it still places all apps in same queue ("pool"), so I had to manually schedule each job into separate queue and configure those queues to eat appropriate amount of resources. This is what I've done:

yarn-site.xml

<property>
   <name>yarn.resourcemanager.scheduler.class</name>
  <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>

<property>
<name>yarn.scheduler.fair.allocation.file</name>
<value>fair-scheduler.xml</value>
</property>

<property>
<name>yarn.scheduler.fair.preemption</name>
<value>true</value>
</property>

The purpose of Fair scheduler - to schedule tasks and distribute resources fairly. But (surprise lmao) it does not do it if there is no preemption enabled. The first task eats all resources and does not give them away until it finishes if you explicitly don't ASK FOR THEM.

That's how I managed preemption:

fair-scheduler.xml

<allocations>
  <pool name="smalltask">
    <schedulingMode>FAIR</schedulingMode>
    <maxRunningApps>4</maxRunningApps>
    <weight>1</weight>
    <fairSharePreemptionThreshold>0.4</fairSharePreemptionThreshold>
    <fairSharePreemptionTimeout>1</fairSharePreemptionTimeout>
  </pool>

  <pool name="bigtask">
    <schedulingMode>FAIR</schedulingMode>
    <maxRunningApps>2</maxRunningApps>
    <fairSharePreemptionThreshold>0.6</fairSharePreemptionThreshold>
    <fairSharePreemptionTimeout>1</fairSharePreemptionTimeout>
    <weight>2</weight>
  </pool>
</allocations>

Now I have 2 queues (big and small), small queue can run 4 small tasks, and big queue can run 2 big tasks at the same time. Big queue weights more, so it demands more resources. If small queue takes more than 40% of resources, other queues begin "nationalizing" it and take resources away. Same for big queue (60%). What happens inside each queue I don't know for sure, but seems like resources try to be distributed equally between apps.

My happy new year wish would be detailed documentation of hadoop and EMR.