Running the same for loop on many GPU threads, using OpenCL

I need to subtract a 2D array, D, from many other different 2D arrays. I have linearized (flattened) all the arrays: D is a 25-element array, and imges is a 1D array were 4 25-element arrays have been joined together. Meaning: if I want to subtract D from 4 5x5 arrays, I'm just turning each one of those 5x5 arrays into one 25-element array, and then appending the 4 arrays. That's what imgs is, in this example it would be a 100-element array. I believe I am capturing this properly in my kernel, index-wise.

The only way that has come to mind to do the subtraction is to run a for loop, so that every element from D will get subtracted from the array in the corresponding thread. My idea was that this would work as follows:

Each thread would receive the D array to be subtracted, and one of the arrays from which D has to be subtracted from (in my example, 1/4 of imges)
I would iterate through the elements of both arrays with a for loop to do the subtraction element by element

However, it is not working as expected: it seems like just the last or first value of D gets chosen and then subtracted from all the elements of the other arrays.

I thought I had a hang of how indexing and threading worked on GPU, but now I am not so sure since this has been challenging me for a while. The kernel is below.

Is there a better way to do this other than with a for loop? Thanks a lot in advance.

__kernel void reduce(__global float* D, __global float* imges, __global float* res)
{
    const int x = (int)get_global_id(0);
    const int y = (int)get_global_id(1);
    const int z = (int)get_global_id(2);

    int im_i = imges[x+25]; //Images are 5x5 meaning a 25-size array

   for(int j = 0; j < 25; j++){
       res[x+25] = im_i - D[j];
   }
}

Edit: I do not wish to parallelize the for loop itself, since the arrays will probably get bigger and I don't want to run into trouble with overhead.

Solution

If I understand what you are trying to do correctly, your kernel should look more like this:

__kernel void reduce(__global float* D, __global float* imges, __global float* res)
{
  const int x = (int)get_global_id(0);

  for(int j = 0; j < 25; j++){
    res[x*25 + j] = imges[x*25 + j] - D[j];
  }
}

This kernel will subtract the jth element of D from the jth element of each work-item's 25-element array in imges.