Run multiple GPU functions on a single GPU in parallel using numba and cupy

I was wondering whether it is possible or even safe to run multiple cupy functions or numba cuda kernels in parallel inside the same code. Currently my code does this:

for i in range(int(nLoop)):
        #shuffle the array
        cp.random.shuffle(temp)
        temp1,temp2 = cp.split(temp,2)
        #configure number of blocks for cuda kernels
        blocks = 0
        Ti = cp.zeros(len(temp1))
        if(len(temp1) >= len(temp2)):
            blocks = int(math.ceil(len(temp1)/tpb))
        else:
            blocks = int(math.ceil(len(temp2)/tpb))
            
        Ti = myCalculations(temp1,temp2) #function that executes some numba cuda kernels
        results[i] = Ti

What I would like to do is split this for loop up into more for loops e.g. 10 loops running in parallel, perhaps using prange. Is this possible to do (of course taking into account that I will need to make temporary lists to store results etc.)?

In terms of GPU memory, it takes up around 600MB out of 16GB for typical use so I don't see issues here.

Solution

It is possible or even safe to run multiple cupy functions or numba cuda kernels in parallel inside the same code.

This is rather a bad idea to do that. Indeed, there are two main ways to run a CPython code in parallel. The first, multithreading, should be safe because of the GIL (global interpreter lock) but it is also very inefficient because of the GIL since it will prevent any Cupy call to be run in parallel unless they release the GIL (which is certainly not the case yet). With the second, multiprocessing, the fork will cause a slow initialization procedure (CUDA runtime initialization, Numba function to be possibly recompiled or fetched from the cache, etc.), and you will need to share GPU data between multiple processes which is a bit tricky to do since you need to use CUDA runtime IPC function from Cupy (see ipcOpenMemHandle for example). In both case, the low-level CUDA function calls will certainly be executed serially (it was the case the last time I tried with a multithreaded application). Moreover, running multiple CUDA kernels in parallel generally does not make them (much) faster because the GPU already execute kernel in parallel. In fact, in multithreaded application, they will always run serially unless you use multiple streams, and even if you do use multiple stream this will generally not be significantly faster in most case. See this post for more information. You cannot use Cupy in Numba codes (especially parallel ones).

The general solution to this problem is to write your own kernel working directly on the whole dataset. This is generally far more efficient than running many small kernels but it can also be far more complex. This is unfortunately how GPU works.

Note that if your loop iteration are completely independent, then you can use multiple stream with a sequential process so the CUDA runtime can better use the GPU by reducing the number of (unwanted) slow synchronizations.