Search code examples
cudamultiprocessingpython-multiprocessinggpgpunumba

How to stop/cancel a cuda kernel launched by Numba before it ends?


I have a simulation written with Python/Numba that uses several cuda GPUs. Each one is launched from a different process using a separate cuda context. This simulation runs a very long loop, and at the end reports the result to the parent process which stores the best result so far, and the process keeps going.

When a GPU / process finishes its kernel and reports a new best result, I like to kill the kernel executions on the other processes / GPUs so they can pick up this new best result and iterate over it, instead of waiting for them to finish. Each execution can take 30 mins, so if I can kill one that just started and go again with better data, that saves me a lot of time.

I can't seem to find a way to stop a launched cuda kernel.

Can this be done?

I'm using Numba 0.51.


Solution

  • It's not possible to stop a running kernel in CUDA without:

    1. assistance from the kernel code itself (or)
    2. corrupting the CUDA context, making any subsequent CUDA operations fail

    Item 2 is not satisfactory, therefore to "asynchronously" stop a running kernel, will require the kernel code (all threads) to "poll" a location that gives an indication to stop.

    A typical way to have a memory location to do this would be to use pinned/zero-copy techniques in CUDA. In numba, this type of memory is allocated using mapped memory. Such memory is accessible from both host and device at the same time. An additional wrinkle is that we require the device code to not cache any copies of the memory locations used for communication. The only method I found in numba to accomplish this is to use atomics.

    Here is a worked example combining these ideas:

    $ cat t51.py
    import numpy as np
    import numba as nb
    
    from numba import cuda
    
    @cuda.jit
    def test(arr):
        while nb.cuda.atomic.max(arr, 0, 0) < 1: #poll for signal to stop
            nb.cuda.atomic.add(arr, 1, 1)        #do "other work"
        arr[2] = 1                               #acknowledge stop signal
    
    if __name__ == '__main__':
    
        arr = nb.cuda.mapped_array(3, dtype=np.int32)
        arr[0] = 0   # stop signal goes here
        arr[1] = 1   # monitoring "other work"
        arr[2] = 0   # acknowledgment of stop signal
        my_str = nb.cuda.stream()
        griddim = (1,1)
        blockdim = (1,1,1)
        test[griddim, blockdim, my_str](arr)   # launch work to be done
        for i in range(1000):  # for demo, give kernel time to start
            if arr[1] < 2:
                print(arr[1])
        print(arr[0])
        while arr[2] != 1:     # send stop signal, and wait for acknowledgment
            arr[0] = 1
        print(arr[0])          # for demo
        nb.cuda.synchronize()  # if stop is working correctly code will not hang here
        print(arr[0])          # for demo
        print(arr[1])
    $ python t51.py
    0
    1
    1
    1600
    $