I have a simulation written with Python/Numba that uses several cuda GPUs. Each one is launched from a different process using a separate cuda context. This simulation runs a very long loop, and at the end reports the result to the parent process which stores the best result so far, and the process keeps going.
When a GPU / process finishes its kernel and reports a new best result, I like to kill the kernel executions on the other processes / GPUs so they can pick up this new best result and iterate over it, instead of waiting for them to finish. Each execution can take 30 mins, so if I can kill one that just started and go again with better data, that saves me a lot of time.
I can't seem to find a way to stop a launched cuda kernel.
Can this be done?
I'm using Numba 0.51.
It's not possible to stop a running kernel in CUDA without:
Item 2 is not satisfactory, therefore to "asynchronously" stop a running kernel, will require the kernel code (all threads) to "poll" a location that gives an indication to stop.
A typical way to have a memory location to do this would be to use pinned/zero-copy techniques in CUDA. In numba, this type of memory is allocated using mapped memory. Such memory is accessible from both host and device at the same time. An additional wrinkle is that we require the device code to not cache any copies of the memory locations used for communication. The only method I found in numba to accomplish this is to use atomics.
Here is a worked example combining these ideas:
$ cat t51.py
import numpy as np
import numba as nb
from numba import cuda
@cuda.jit
def test(arr):
while nb.cuda.atomic.max(arr, 0, 0) < 1: #poll for signal to stop
nb.cuda.atomic.add(arr, 1, 1) #do "other work"
arr[2] = 1 #acknowledge stop signal
if __name__ == '__main__':
arr = nb.cuda.mapped_array(3, dtype=np.int32)
arr[0] = 0 # stop signal goes here
arr[1] = 1 # monitoring "other work"
arr[2] = 0 # acknowledgment of stop signal
my_str = nb.cuda.stream()
griddim = (1,1)
blockdim = (1,1,1)
test[griddim, blockdim, my_str](arr) # launch work to be done
for i in range(1000): # for demo, give kernel time to start
if arr[1] < 2:
print(arr[1])
print(arr[0])
while arr[2] != 1: # send stop signal, and wait for acknowledgment
arr[0] = 1
print(arr[0]) # for demo
nb.cuda.synchronize() # if stop is working correctly code will not hang here
print(arr[0]) # for demo
print(arr[1])
$ python t51.py
0
1
1
1600
$