c++image-processing boost parallel-processing opencl

Wrong pixel values when using padded local buffer OpenCL

I'm facing an unexpected result when I use a local buffer to copy data in an OpenCL kernel. The code presented here is quite simple (and useless since I don't need to use a local buffer for such an operation), but this is a first step for convolution-like processes.

Here is my code :

std::string implementCopyFromLocalKernel()
{
    return BOOST_COMPUTE_STRINGIZE_SOURCE(
    __kernel void copyFromLocal_knl(__global const float* in,
                                    const ulong sizeX, const ulong sizeY,
                                    const int filterRadiusX, const int filterRadiusY,
                                    __local float* localImage,
                                    const ulong localSizeX, const ulong localSizeY,
                                    __global float* out)
    {
        // Store each work-item’s unique row and column
        const int x = get_global_id(0);
        const int y = get_global_id(1);

        // Group size
        int groupSizeX = get_local_size(0);
        int groupSizeY = get_local_size(1);

        // Determine the size of the work group output region
        int groupIdX = get_group_id(0);
        int groupIdY = get_group_id(1);

        // Determine the local ID of each work item
        int localX = get_local_id(0);
        int localY = get_local_id(1);

        // Padding
        int paddingX = filterRadiusX;
        int paddingY = filterRadiusY;

        // Cache the data to local memory
        // Copy the data for the current coordinates
        localImage[localX + localY*localSizeX] = in[x + y * sizeX];

        barrier(CLK_LOCAL_MEM_FENCE);

        out[x + y * sizeX] = localImage[localX + localY*localSizeX];

        return;
    }
    );
}

void copyLocalBuffer(const boost::compute::context& context,  boost::compute::command_queue& queue, const boost::compute::buffer& bufInn boost::compute::buffer& bufOut, const size_t sizeX, const size_t sizeY)
{
    const size_t nbPx = sizeX * sizeY;
    const size_t maxSize = (sizeX > sizeY ? sizeX : sizeY);

    // Prepare to launch the kernel
    std::string kernel_src = implementCopyFromLocalKernel();
    boost::compute::program program;
    try {
        program = boost::compute::program::create_with_source(kernel_src, pGpuDescription->getContext(deviceIdx));
        program.build();
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error bulding program from source : " << std::endl << e.what() << std::endl
            << program.build_log() << std::endl;
        return;
    }

    boost::compute::kernel kernel;
    try {
        kernel = program.create_kernel("copyFromLocal_knl");
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error creating kernel : " << std::endl << e.what() << std::endl;
        return;
    }

    try {
        int localSizeX = 16;
        int localSizeY = 16;
        int paddingPixelsX = 2;// 0; // <- Changing to 0 works
        int paddingPixelsY = paddingPixelsX;

        int localWidth = localSizeX + 2 * paddingPixelsX;
        int localHeight = localSizeY + 2 * paddingPixelsY;

        boost::compute::buffer localImage(context, localWidth*localHeight * sizeof(float));

        kernel.set_arg(0, bufIn);
        kernel.set_arg(1, sizeX);
        kernel.set_arg(2, sizeY);
        kernel.set_arg(3, paddingPixelsX);
        kernel.set_arg(4, paddingPixelsY);
        kernel.set_arg(5, localImage);
        kernel.set_arg(6, localWidth);
        kernel.set_arg(7, localHeight);
        kernel.set_arg(8, bufOut);
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error setting kernel arguments: " << std::endl << e.what() << std::endl;
        return;
    }

    try {

        size_t origin[2] = { 0, 0 };
        size_t region[2] = { 256, 256 };// { sizeX, sizeY };
        size_t localSize[2] = { 16, 16 };
        queue.enqueue_nd_range_kernel(kernel, 2, origin, region, localSize);
    }
    catch (const boost::compute::opencl_error& e) {
        std::cout << "Error executing kernel : " << std::endl << e.what() << std::endl;
        return;
    }
}

I reduced the code to simply copy the pixels corresponding to each work item in the associated local coordinate of the local image. Hence, the local image buffer must have unused data for 2*paddingPixelsX on each line and 2*paddingPixelsY unused lines.

It works if I don't add padding data (paddingPixelsX and paddingPixelsY = 0), but it seems that some work items don't read the data from the input buffer or write the data into the ouput buffer (or the local buffer?) in the correct place. Moreover, when I run my program several times, I never get the same result.

This is an example of result I get (right) for the mandrill image as input (left) :

I ensure that the threads are synchronized with barrier(CLK_LOCAL_MEM_FENCE); and each work item read and write a specific data and if my code is buggy, I don't understand why no padding don't gives errors.

Does someone has an idea?

Thanks,

Solution

Thanks to @doqtor, I understood that the issue came from the buffer passed as kernel parameter. Because of that, all work group used the same buffer.

Since I don't know the padding size I will need for convolution operations, I need this buffer as parameter. I modified the kernel parametrization so that a different buffer is used by each work group :

kernel.set_arg(5, localWidth*localHeight*sizeof(float), NULL);

I missed the important part when I read the documentation of clSetKernelArg:

If the argument is declared with the __local qualifier, the arg_value entry must be NULL.