I need to subtract a 2D array, D
, from many other different 2D arrays. I have linearized (flattened) all the arrays: D
is a 25-element array, and imges
is a 1D array were 4 25-element arrays have been joined together. Meaning: if I want to subtract D
from 4 5x5 arrays, I'm just turning each one of those 5x5 arrays into one 25-element array, and then appending the 4 arrays. That's what imgs
is, in this example it would be a 100-element array. I believe I am capturing this properly in my kernel, index-wise.
The only way that has come to mind to do the subtraction is to run a for
loop, so that every element from D
will get subtracted from the array in the corresponding thread. My idea was that this would work as follows:
Each thread would receive the D
array to be subtracted, and one of the arrays from which D
has to be subtracted from (in my example, 1/4 of imges
)
I would iterate through the elements of both arrays with a for loop to do the subtraction element by element
However, it is not working as expected: it seems like just the last or first value of D
gets chosen and then subtracted from all the elements of the other arrays.
I thought I had a hang of how indexing and threading worked on GPU, but now I am not so sure since this has been challenging me for a while. The kernel is below.
Is there a better way to do this other than with a for loop? Thanks a lot in advance.
__kernel void reduce(__global float* D, __global float* imges, __global float* res)
{
const int x = (int)get_global_id(0);
const int y = (int)get_global_id(1);
const int z = (int)get_global_id(2);
int im_i = imges[x+25]; //Images are 5x5 meaning a 25-size array
for(int j = 0; j < 25; j++){
res[x+25] = im_i - D[j];
}
}
Edit: I do not wish to parallelize the for
loop itself, since the arrays will probably get bigger and I don't want to run into trouble with overhead.
If I understand what you are trying to do correctly, your kernel should look more like this:
__kernel void reduce(__global float* D, __global float* imges, __global float* res)
{
const int x = (int)get_global_id(0);
for(int j = 0; j < 25; j++){
res[x*25 + j] = imges[x*25 + j] - D[j];
}
}
This kernel will subtract the j
th element of D
from the j
th element of each work-item's 25-element array in imges
.