Nested loop summation in 2D using OpenCL

I have recently started working with OpenCL in C++ and I'm trying to fully understand how to use 2D and 3D NDRange. I'm currently implementing Inverse Distance Weighting in OpenCL, but my problem is general.

Below is the serial function to compute the weights and it consists of a nested loop.

void computeWeights(int nGrids, int nPoints, double *distances, double *weightSum, const double p) {

    for (int i = 0; i < nGrids; ++i) {
        double sum = 0;
        for (int j = 0; j < nPoints; ++j) {
            double weight = 1 / pow(distances[i * nPoints + j], p);
            distances[i * nPoints + j] = weight;
            sum += weight;
        }
        weightSum[i] = sum;
    }
}

What I would want is to implement the above function using a 2D NDRange, the first being over nGrids and the second over nPoints. What I don't understand, though, is how to handle the summation of the weights into weightSum[i]. I understand that I may have to use parallel sum reduction, somehow.

Solution

When dispatching a kernel with a 2D global workspace, OpenCL creates a grid of work-items. Each work-item executes the kernel and gets unique ids in both those dimensions.

(x,y)|________________________
     | (0,0) (0,1) (0,2) ...
     | (1,0) (1,1) (1,2)
     | (2,0) (2,1) (2,2)
     | ...

The work-items are also divided into groups and get unique ids within those work-groups. E.g. for work-groups of size (2,2):

(x,y)|________________________
     | (0,0) (0,1) (0,0) ...
     | (1,0) (1,1) (1,0)
     | (0,0) (0,1) (0,0)
     | ...

You can arrange the work-groups, so that each one of them performs a reduction.

Your SDK probably has samples, and a parallel reduction will be one of them.

To get you started, here is a kernel that solves your problem. It's in its simplest form, and works for a single work-group per row.

// cl::NDRange global(nPoints, nGrids);
// cl::NDRange local(nPoints, 1);
// cl::Local data(nPoints * sizeof (double));
kernel
void computeWeights(global double *distances, global double *weightSum, local double *data, double p)
{
    uint nPoints = get_global_size(0);

    uint j = get_global_id(0);
    uint i = get_global_id(1);
    uint lX = get_local_id(0);

    double weight = 1.0 / pow(distances[i * nPoints + j], p);

    distances[i * nPoints + j] = weight;

    data[lX] = weight;

    for (uint d = get_local_size(0) >> 1; d > 0; d >>= 1)
    {
        barrier(CLK_LOCAL_MEM_FENCE);
        if (lX < d)
            data[lX] += data[lX + d];
    }

    if (lX == 0) 
        weightSum[i] = data[0];
}

Each row of work-items (i.e. each work-group) computes the weights (and their sum) for grid i. Each work-item computes a weight, stores it back to distances, and loads it onto local memory. Then each work-group performs a reduction in local memory, and finally the result gets stored in weightSum.