Search code examples
openclpyopencl

How to cleanly exit OpenCL code


I'm looking for a simple and clean way, to tell my host, that some code in OpenCL lead to an error and it should abandon further work. I (think to) know, that try, catch or assert doesn't work in OpenCL C. Furthermore, Kernels have to be defined as non-returning functions, so simply Error-Code returning is out of the image as well. The only idea I had is passing a cl_mem object between host and kernel and check its value between Kernel enqueues or launches, which somehow enforces a very strong kind of serialization. Is there a better idea, maybe using events?


Solution

  • If you need a similar thing like

    for(i 0 to N)
    {
        do work (i)
        error ? break;
    }
    

    in a parallel fashion,

    int threadId=get_global_id(0);
    
    // a broadcast read(for a new gpu) so no performance hit
    mem_fence(CLK_GLOBAL_MEM_FENCE_READ)
    if(error[0]==0) // or use atomic_add(&error[0],0) to read atomically(when total number of threads is low like thousands)
    {
         do work (threadId);
         error? atomic_add(&error[0],errCode)
         mem_fence(CLK_GLOBAL_MEM_FENCE_WRITE)
    }
    

    so at least you save cycles at thread-group-level which should let later threads complete quickly if they started after an atomic error write. Atomic operations are slow but an error handling should make it less important right? Also depending on device type and drivers, it could need at least thousands of threads between an atomic write and a proper non-atomic read so for a million threads it could be efficient but for a thousand threads, you should use atomic read (atomic add with a zero value) so each thread will add extra cycle(s) before actual work begins but at least its latency may be hidden with heavy compute.

    If you have multiple devices that need to inform each other about errors, you should use USE_HOST_PTR for error buffer to read/write error codes on host memory directly instead of using device memory. This could be less performant than device memory since error buffer will not be cached and will be far away from device so maybe 5GB/s of pci-e bandwidth bottleneck instead of 5TB/s of device memory(assuming broadcasts to all cores with single cycle, for latest graphics cards)