Search code examples
google-cloud-dataproc

How to queue new jobs when running Spark on DataProc


How is it possible to submit several jobs to Google Dataproc (PySpark) and queue the jobs which do not fit on the current executors?

Only submitting jobs does not work for queueing, here the output for any following job:

 $ gcloud dataproc jobs submit pyspark myjob.py
 ...
 WARN  Utils:70 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041

YARN should take a "queue" parameter for this purpose. However, I cannot find any documentation about using it with dataproc...?


Solution

  • In your case you probably want to just ignore that warning; it's actually just a harmless warning, and your driver is indeed getting properly queued up on that same cluster; ports are simply bound to successive port numbers starting at 4040 when multiple drivers are running on the same host (the dataproc master). Note that this does not mean the later submission actively waits for the first to finish; the job submission tries to run as much concurrently as there are resources for. In Dataproc, if you submit, say, 100 jobs, you should see something like 10 of them (varies depending on machine sizes, cluster sizes, etc) immediately get queued up in YARN, and several (or all) of them will successfully get enough YARN containers to begin running while others remain PENDING in YARN. As they complete, Dataproc will then incrementally submit those remaining 90 jobs to YARN as resources become available.

    At the moment there's no specialty support for YARN queues, though it's supported if you want to customize your YARN queues at cluster-creation time using:

    gcloud dataproc clusters create --properties \
        ^;^yarn:yarn.scheduler.capacity.root.queues=foo,bar,default;spark:other.config=baz
    

    (replacing the gcloud delimiter with ; to pass through the comma-separated list) and/or additional yarn-site.xml configs as outlined in tutorials like this one, and then you specify the queue with:

    gcloud dataproc jobs submit spark --properties spark.yarn.queue=foo
    

    though that won't change what you see about the port 4040 warning. This is because the default setting is to use yarn-client mode for Spark, meaning the driver program runs on the master node, and driver submission is not subject to YARN queuing.

    You can use yarn-cluster mode as follows:

    gcloud dataproc jobs submit spark --properties \
        spark.master=yarn-cluster,spark.yarn.queue=foo
    

    Then it'll use the foo yarn-queue if you've defined it, as well as using yarn-cluster mode so that the driver program runs in a YARN container. In this case you'd no longer hit any port 4040 warnings, but in yarn-cluster mode you also don't get to see the stdout/stderr of your driver program in the Dataproc UI anymore.