Search code examples
hadoophadoop-yarngoogle-cloud-dataproc

Running submitted job sequentially in Google Cloud Dataproc


I created Google Dataproc cluster with 2 workers using n1-standard-4 VMs for master and workers.

I want to submit jobs on a given cluster and all jobs should run sequentially (like on AWS EMR), i.e., if first job is in running state then upcoming job goes to pending state, after completing first job, second job starts running.

I tried with submitting jobs on cluster but it run all jobs in parallel - no jobs went to pending state.

Is there any configuration that I can set in Dataproc cluster so all jobs will run sequentially?

Updated following files :

/etc/hadoop/conf/yarn-site.xml

  <property>
      <name>yarn.resourcemanager.scheduler.class</name>
      <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
   </property>
   <property>
      <name>yarn.scheduler.fair.user-as-default-queue</name>
      <value>false</value>
   </property>
   <property>
      <name>yarn.scheduler.fair.allocation.file</name>
      <value>/etc/hadoop/conf/fair-scheduler.xml</value>
   </property>

/etc/hadoop/conf/fair-scheduler.xml

<?xml version="1.0" encoding="UTF-8"?>
<allocations>
   <queueMaxAppsDefault>1</queueMaxAppsDefault>
</allocations>

After that restart services using this command systemctl restart hadoop-yarn-resourcemanager the above changes on master node. But still job running in parallel.


Solution

  • Dataproc tries to execute submitted jobs in parallel if resources are available.

    To achieve sequential execution you may want to use some orchestration solution, either Dataproc Workflows or Cloud Composer.

    Alternatively, you may want to configure YARN Fair Scheduler on Dataproc and set queueMaxAppsDefault property to 1.