OpenCL kernel communication using volatile memory

I'm trying to get two OpenCL kernels to communicate with eath other. A worker kernel runs a loop and a control kernel feeds it jobs and tells it when it's done. I'm using a volatile device buffer for communication. It works when I'm using the Intel OpenCL 2.1 platform, but when I'm using the Nvidia OpenCL 3.0 CUDA (Quadro P400) platform the program hangs. It appears that the worker kernel loops forever.

MWE below. The PLAT_IDX define can be 0 or 1 and selects between the Intel and Nvidia platforms:

#include <assert.h>
#include <stdio.h>
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>
#define PLAT_IDX 0
#define CHK(x)      assert((x) == CL_SUCCESS)

const char *PROGRAM = (
    "__kernel void loop(volatile __global uint *buf) {"
    "   while (buf[0] != 123) { buf[1]++; }"
    "}"
    "__kernel void post(volatile __global uint *buf) {"
    "   buf[0] = 123;"
    "}"
);

int main(int argc, char *argv[]) {
    cl_uint n_platforms, n_devices;
    CHK(clGetPlatformIDs(0, NULL, &n_platforms));
    cl_platform_id plats[2];
    CHK(clGetPlatformIDs(n_platforms, plats, NULL));
    CHK(clGetDeviceIDs(plats[PLAT_IDX], CL_DEVICE_TYPE_ALL, 0, NULL, &n_devices));
    cl_device_id dev;
    CHK(clGetDeviceIDs(plats[PLAT_IDX], CL_DEVICE_TYPE_ALL, 1, &dev, NULL));
    assert(n_platforms > 0 && n_devices > 0);

    cl_int err;
    cl_context ctx = clCreateContext(NULL, 1, &dev, NULL, NULL, &err);
    CHK(err);
    cl_command_queue_properties props[] = {
        CL_QUEUE_PROPERTIES,
        CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, 0
    };
    cl_command_queue queue = clCreateCommandQueueWithProperties(
        ctx, dev, props, &err);
    CHK(err);

    cl_program prog = clCreateProgramWithSource(ctx, 1, &PROGRAM, NULL, &err);
    CHK(err);
    CHK(clBuildProgram(prog, 1, &dev, NULL, NULL, NULL));
    cl_kernel loop = clCreateKernel(prog, "loop", &err);
    CHK(err);
    cl_kernel post = clCreateKernel(prog, "post", &err);
    CHK(err);
    cl_mem mem = clCreateBuffer(
        ctx, CL_MEM_READ_WRITE, 2 * sizeof(cl_uint), NULL, &err);
    CHK(err);
    cl_event evs[2];
    CHK(clSetKernelArg(loop, 0, sizeof(cl_mem), &mem));
    CHK(clSetKernelArg(post, 0, sizeof(cl_mem), &mem));
    CHK(clEnqueueNDRangeKernel(
            queue, loop, 1, NULL, (size_t[]){1},
            NULL, 0, NULL, &evs[0]));
    CHK(clEnqueueNDRangeKernel(
            queue, post, 1, NULL, (size_t[]){1},
            NULL, 0, NULL, &evs[1]));
    printf("Waiting for kernels\n");
    CHK(clWaitForEvents(2, evs));
    return 1;
}

Maybe there is a better way to accomplish this? OpenCL 2.0 has pipes, but very few devices supports them.

Solution

You cannot have two OpenCL kernels communicate. The volatile keyword doesn't enable that either. Kernels are placed in a queue and executed one after the other. Only in separate queues, kernels might get executed at the same time, but there is no guarantee.

Try to solve it without requiring kernel communication. Have one kernel finish and store results in global memory, and then the other kernel work with the results of the first kernel.

As a side note, you can do communication across threads within one kernel. As long as the threads are in the same workgroup, you can use local memory. If you need to communicate across workgroups, you can still do that with atomics.