Search code examples
scalaapache-sparkhadoophadoop-yarnamazon-emr

Submitting Multiple Jobs in Sequence


I'm having some trouble understanding how Spark allows for scheduling of jobs. I have a series of jobs I'd like to run in sequence. From what I've read, I can submit any number of jobs to spark-submit and it will manage scheduling automatically based on available resources, but I want to guarantee that the jobs will run in order, waiting for the previous job to complete. I understand that I can write a script that just submits the jobs one after another, but I'm wondering if Spark has a built-in mechanism to handle these kinds of submissions.

What's more, I have several of these series of jobs. Supposing I have a series of jobs A -> B -> C and another D -> E -> F, I'd be fine with any one of A, B, or C running concurrently with any of D, E, or F, but not with any of A, B, or C running concurrently with any of A, B, or C. Does Spark have a built-in mechanism to handle this use case?

I've read a little about yarn's queueing mechanism allowing for multiple queues, but I'm not sure if this is the solution I'm looking for.

Thanks!


Solution

  • Yarn role is to distribute resources among your job.

    If you submit all your job in the same time, they will start in different order, based on the resources requested, the queue priority, the queue strategy (fifo or fair) and so on.

    What you can do is making 3 differents queues with different priority and submit all job in same time, but that seam pretty dangerous.

    You are basically looking for a scheduler like airflow or Oozie