Finding the minimum between threads

This question is related to an existing question posted by me a couple of weeks ago: TERCOM algorithm - Changing from single thread to multiple threads in CUDA

Briefly explained, each of the threads in the kernel calculates a MAD value and I would like to know the minimum and its location.

I've tried to use atomicMin like this

__global__ void kernel (int m, int n, int h, int N, int *f, float heading, float *measurements, int *global_min) 
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;

    float MAD=0;
    float pos[2];
    float theta=heading*(PI/180);
    float fval = 0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    //Calculate Mean Absolute Difference
    if(idx < n && idy < m)
    {
        for(float g=0; g<N; g++)
        {
            float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
            MAD += abs(measurements[(int)g]-fval); 
        }
    }
    cuPrintf("%.2f \n",MAD);

    atomicMin(global_min, MAD);
    pos[0]=idx;
    pos[1]=idy; 

    f[0]=*global_min;
    f[1]=pos[0];
    f[2]=pos[1];
}

And it produce the right result, but atomicMin is unable to find the location of the minimum.

I also tried to use the thrust library

__global__ void kernel (int m, int n, int h, int N, int *f, float heading, float *measurements, int *global_min, float *dev_MAD) 
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;

    float theta=heading*(PI/180);
    float fval = 0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    //Calculate Mean Absolute Difference
    if(idx < n && idy < m)
    {
        for(float g=0; g<N; g++)
        {
            float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
            *dev_MAD += abs(measurements[(int)g]-fval); 
        }
    }
    cuPrintf("%.2f \n",MAD);
}

And calling the kernel like this

kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements, global_min, dev_MAD);

thrust::device_ptr<float> dev_ptr(dev_MAD); 
thrust::device_ptr<float> min_pos = thrust::min_element(dev_ptr, dev_ptr + n*m);
int abs_pos = min_pos - dev_ptr;
float min_val=min_pos[0];

cudaMemcpy(&min_val, dev_MAD+abs_pos, sizeof(float), cudaMemcpyDeviceToHost);

// Print out the result
printf("Min=%.2f pos=%d\n",min_val,abs_pos);

But this program print out: Min=-207521258711807190000000000000000000000.00 pos=0

I've looked at many reduction examples, but it seems like in everyone they have the values stored in an array, and not in each individual thread.

So to the questions:

Is it possible to make the atomicMin function return the location?
Can anyone give me a hint on how to solved the problem with the thrust library?

Solution

For your Thrust code, you're writing to dev_MAD[0], yet computing results as though you've written to the entire array.

IIUC, you're trying to find the minimum value and corresponding location, you have the values as a variable in each thread but not stored in memory.

There are a couple of easy ways I can think to do this but both involve storing the values to memory and computing the minimum/location in a second pass.

Firstly, you could just use Thrust's min_element as you have already tried, but you would store the values to a device_vector in your kernel and then call thrust::min_element independently.

Secondly, you could save some memory space and bandwidth by computing the minimum/location within the threadblock first (and then use thrust::min_element afterwards). For this you could use CUB's reduction with a custom reduce operator (compare on value, datum is {value,index}).