Search code examples
apache-sparkhadoop-yarn

Scheduler delay time in spark and YARN


I'm doing some instrumentation in Spark and I've realised that some of my tasks take really long times to complete because the Scheduler Delay Time that can be extracted from the TaskMetrics. I know there are some questions already about this topic like this What is scheduler delay in spark UI's event timeline but the answers have not been accepted and it says that a task waiting for an open slot is considered scheduler delay, which I think is not true (as far as I know if a task doesn't have a slot into an executor it doesn't start generating metrics).

I'm a bit confused with from where does this Delay really starts. I was wondering if this Delay time takes also into account the period between an app being accepted by the YARN client and submitting the first job of the app. Or in other words, between this moment where the app is accepted:

enter image description here

and this one where is running:

enter image description here enter image description here


Solution

  • I checked directly by launching one app with few resources available in the cluster. It stayed in the queue until enough executors could be launched for the stage. Then the yarn.Client launched the stage in the cluster. The metrics in spark don't consider this time in the queue as any delay. Also it doesn't matter if you have more tasks than cores like the stack overflow answer I posted above. The tasks will be allocated in the executors as they become available.

    In short, scheduler delay time only considers sending the task to the executor. If there is a delay in here, YARN is not the bottleneck but the load in the nodes involved ( normally the driver and the worker nodes with the executors for the app)

    enter image description here