Why is local memory in this OpenCL algorithm so slow?

I am writing some OpenCL code. My kernel should create a special "accumulator" output based on an input image. I have tried two concepts and both are equally slow, although the second one uses local memory. Could you please help me identify why the local memory version is so slow? The target GPU for the kernels is a AMD Radeon Pro 450.

// version one
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
  const unsigned int x = get_global_id(0);
  const unsigned int y = get_global_id(1);

  int ind;

  for(k = SOME_BEGINNING; k <= SOME_END; k++) {
    // some pretty wild calculation
    // ind is not linear and accesses different areas of the output
    ind = ...
    if(input[y * WIDTH + x] == 255) {
      atomic_inc(&output[ind]);
    }
  }

}

// variant two
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
  const unsigned int x = get_global_id(0);
  const unsigned int y = get_global_id(1);

  __local int buf[7072];
  if(y < 221 && x < 32) {
    buf[y * 32 + x] = 0;
  }
  barrier(CLK_LOCAL_MEM_FENCE);

  int ind;
  int k;

  for(k = SOME_BEGINNING; k <= SOME_END; k++) {
    // some pretty wild calculation
    // ind is not linear and access different areas of the output
    ind = ...
    if(input[y * WIDTH + x] == 255) {
      atomic_inc(&buf[ind]);
    }
  }

  barrier(CLK_LOCAL_MEM_FENCE);
  if(get_local_id(0) == get_local_size(0) - 1)
    for(k = 0; k < 7072; k++)
      output[k] = buf[k];
  }

}

I would expect that the second variant is faster than the first one, but it isn't. Sometimes it is even slower.

Solution

Local buffer size __local int buf[7072] (28288 bytes) is too big. I don't know how big shared memory for AMD Radeon Pro 450 is but likely that is 32kB or 64kB per computing unit.

32768/28288 = 1, 65536/28288 = 2 means only 1 or maximum 2 wavefronts (64 work items) can run simultaneously only, so occupancy of computing unit is very very low hence poor performance.

Your aim should be to reduce local buffer as much as possible so that more wavefronts can be processed simultaneously.

Use CodeXL to profile your kernel - there are tools to show you all of this. Alternatively you can have a look at CUDA occupancy calculator excel spreadsheet if you don't want to run the profiler to get a better idea of what that is about.