Time-sliced GPU scheduler

I saw this question.

The answer states

The scheduler is described as a "time-sliced" scheduler in the latest MPS doc and what appears to be happening is that rather than wait for a kernel from one process to complete, the scheduler may, according to some unpublished rules, choose to pre-empt a running kernel so that it can switch to another kernel from another process.
...
However, as described in the MPS doc, the code from kernel A is not executing in the same clock cycle(s) as the code from kernel B, when A and B originate from separate processes in the non-MPS case.

I tested a few machine learning programs (training deep models). Running a single process, and running 3 (identical) processes in parallel (say with bash) take almost the same time. Moreover, the GPU-Util field in nvidia-smi seems to go up significantly. The outputs of these processes come out parallelly.

How is this possible with time-slicing? Why is the time not (roughly) equal to 3 times the single process time?

Further, if a single context is run at a single point in time, why does GPU-Util go up? And context-switching doesn't create further overheads?

Using MPS does not seem to create any difference.

Solution

Context/Preamble:

Deep learning training usually proceeds as a sequence of epochs, and in the GPU case, each epoch will have a sequence of kernels launched, associated with the work being done in that epoch.

You've already mentioned that GPU utilization appears to be lower in the single process case and higher in the multi-process case. So lets consider an example. Suppose the GPU utilization pattern looks like this:

Epoch:     1   2   3   4   ...
Activity:  X   X   X   X   ...

The "Activity" represents kernel activity, and is intended to be a timeline-like or profiler-like view of activity. Therefore we can observe compared to a fully occupied activity timeline (XXXXXX....) this process seems to use about 1/4 of the available timeline, and we'll assume a GPU utilization measurement reports a number around 25%.

Now suppose we have 3 of these processes. The GPU is a context-switching machine. Even if we ignored CUDA, the GPU is designed to interleave tasks like general graphics rendering, shader (program) processing, video processing, and other tasks. One of the mechanisms it uses to interleave the work is to do context-switching between these various tasks. This allows your desktop to update, a 3D graphical application window to update, and a video window to all update on your graphical desktop, all at the "same time". In a very simplistic definition, a GPU will context switch to another task when the current task has no pending work, and other tasks have pending work.

With 3 of those machine learning tasks, replacing the X for each task with a number to distinguish (1 is activity from task 1, 2 is activity from task 2, etc.), even without considering time-slicing, the context-switching mechanism would allow the GPU to process the work as follows:

Epoch:     1   2   3   4   ...
Activity:  123 123 123 123 ...

A few observations:

the apparent utilization has now increased from 25% to 75%
considering "wall clock time", all the work is getting done in approximately the same amount of time.

Both of these observations are consistent with statements in your question. You probably already understood all this, but others reading your question might like some context.

Questions:

How is this possible with time-slicing?

Prior to time-slicing (prior to Pascal GPUs, I believe, based on my observations), the GPU used a "cooperative" form of context-switching. A detailed highly accurate description here is not necessary. Instead, we can imagine, for example, that if 3 processes each launched kernels at the same time, then the GPU would process the kernel from process 1, first, to completion, followed by the kernel from process 2, followed by the kernel from process 3. This is what I refer to as "cooperative" context-switching. The exact rules are unimportant, instead the work is processed as I previously mentioned: when a given process activity is idle, and there is pending work in another process, the GPU will context switch to that process.

This works pretty well except in cases where a single process launches work (e.g. a single kernel) that would run for a very long time, and the work gives no convenient context-switching point (e.g. a kernel boundary) until the work is complete. To handle these situations, newer GPUs instead of using "cooperative" context-switching, can use "time-sliced" context switching As an example of the difference, if the GPU is time-slicing, and is currently running a kernel from process 1, it may, at some point, halt the processing of kernel 1, and context-switch to process 2, and begin processing kernel 2. It need not wait for a "convenient" point such as a kernel boundary.

However, time-slicing does not imply that the GPU will context-switch to another process even if that process has no work for the GPU to do. The same definition I have still applies: the GPU will context switch to another process when that process has work to do.

So, combining these ideas, context switching with time-slicing does not imply that our view of the 3 DL training jobs need be any different. The same interleaving can occur, with the same results/observations.

Further, if a single context is run at a single point in time, why does GPU-Util go up?

GPU utilization is a measure of activity over a period of time. Hopefully this question is already answered with the timeline "pictures" above. Suppose for the sake of argument that the period of measurement corresponds to the time taken by a single epoch. In the single process case, we observe a utilization of 25%. In the 3 process case, we observe a utilization of 75%, because the work from 3 processes can interleave due to context-switching. The possibility of time-slicing doesn't really impact this treatment, to a first-order approximation.

And context-switching doesn't create further overheads?

Yes, context-switching involves overhead, whether time-slicing is used or not. However the GPU is designed to context switch rapidly, so that the aforementioned graphical workloads can proceed with an apparent concurrency. For 3 processes, the context-switch overhead might be very low, perhaps on the order of a few milliseconds per epoch or less (this is really just an example, not a specific statement). If your epoch processing time is on the order of hundreds of milliseconds or longer, the context switch overhead might not be significant.

Using MPS does not seem to create any difference.

MPS to a first order approximation (for descriptive purposes, not an actual statement of behavior) allows work from multiple processes to behave as if they were submitted from a single process. This allows a few benefits, including:

context switching overhead is reduced
kernels from different processes can run concurrently

So MPS will be valuable if your work involves so much context switching, and the submitted work per process is in such small chunks, that context switch overhead starts to become a noticeable percentage of the timeline.

MPS will also be valuable, somewhat repetitively, if the submitted work (kernels) are so small in scope that they don't fully occupy the GPU. The ability to overlap such kernels may improve overall utilization.

Without making any measurements, we could assume that intelligent GPU runtime designers would design the time-slice interval to be much much larger than the context switch overhead/cost.

I'm not suggesting that the GPU requires MPS in order to do time-slicing, or any relationship (really) between MPS and the underlying context-switch mechanism. Time-slicing could be active with or without MPS. However, MPS implies that time-slicing and context switching may be largely "unnecessary" when running work in a MPS setting. If the above indicated benefits of MPS are not meaningful/relevant for your particular workload, then indeed:

Using MPS does not seem to create any difference.