This question is related to an existing question posted by me a couple of weeks ago: TERCOM algorithm - Changing from single thread to multiple threads in CUDA
Briefly explained, each of the threads in the kernel calculates a MAD value and I would like to know the minimum and its location.
I've tried to use atomicMin like this
__global__ void kernel (int m, int n, int h, int N, int *f, float heading, float *measurements, int *global_min)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
float MAD=0;
float pos[2];
float theta=heading*(PI/180);
float fval = 0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
//Calculate Mean Absolute Difference
if(idx < n && idy < m)
{
for(float g=0; g<N; g++)
{
float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
MAD += abs(measurements[(int)g]-fval);
}
}
cuPrintf("%.2f \n",MAD);
atomicMin(global_min, MAD);
pos[0]=idx;
pos[1]=idy;
f[0]=*global_min;
f[1]=pos[0];
f[2]=pos[1];
}
And it produce the right result, but atomicMin is unable to find the location of the minimum.
I also tried to use the thrust library
__global__ void kernel (int m, int n, int h, int N, int *f, float heading, float *measurements, int *global_min, float *dev_MAD)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
float theta=heading*(PI/180);
float fval = 0;
// Calculate how much to move in x and y direction
float offset_x = h*cos(theta);
float offset_y = -h*sin(theta);
//Calculate Mean Absolute Difference
if(idx < n && idy < m)
{
for(float g=0; g<N; g++)
{
float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
*dev_MAD += abs(measurements[(int)g]-fval);
}
}
cuPrintf("%.2f \n",MAD);
}
And calling the kernel like this
kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements, global_min, dev_MAD);
thrust::device_ptr<float> dev_ptr(dev_MAD);
thrust::device_ptr<float> min_pos = thrust::min_element(dev_ptr, dev_ptr + n*m);
int abs_pos = min_pos - dev_ptr;
float min_val=min_pos[0];
cudaMemcpy(&min_val, dev_MAD+abs_pos, sizeof(float), cudaMemcpyDeviceToHost);
// Print out the result
printf("Min=%.2f pos=%d\n",min_val,abs_pos);
But this program print out: Min=-207521258711807190000000000000000000000.00 pos=0
I've looked at many reduction examples, but it seems like in everyone they have the values stored in an array, and not in each individual thread.
So to the questions:
For your Thrust code, you're writing to dev_MAD[0], yet computing results as though you've written to the entire array.
IIUC, you're trying to find the minimum value and corresponding location, you have the values as a variable in each thread but not stored in memory.
There are a couple of easy ways I can think to do this but both involve storing the values to memory and computing the minimum/location in a second pass.
Firstly, you could just use Thrust's min_element as you have already tried, but you would store the values to a device_vector in your kernel and then call thrust::min_element independently.
Secondly, you could save some memory space and bandwidth by computing the minimum/location within the threadblock first (and then use thrust::min_element afterwards). For this you could use CUB's reduction with a custom reduce operator (compare on value, datum is {value,index}).