Is there a kernel queue inside CUDA enabled GPU?

When multiple PyTorch process is running inference on the same Nvidia GPU. I would like to know what happens when two kernel requests(cuLaunchKernel) from different contexts are handled by CUDA? Can CUDA GPU have a FIFO queue for those kernel requests?

I have no idea about measuring the state of CUDA when running my PyTorch program. Any advice on how to profile a Nvidia GPU when running multiple concurrent jobs is helpful!

Solution

Kernels from different contexts never run at the same time. They run in time-sharing way. (Unless MPS is used)

Within the same CUDA context, kernels launched on the same CUDA stream never run at the same time. Instead, they are serialized by the launch order and GPU executes them one at a time. So CUDA stream is similar to a queue in the CUDA context. Kernels launched on different CUDA streams (in the same context) have the potential to run concurrently.

Pytorch by default uses one CUDA stream. You can use APIs to manipulate multiple streams: https://pytorch.org/docs/stable/notes/cuda.html#cuda-streams