Shared memory mutex with CUDA - adding to a list of items

My problem is the following: I have an image in which I detect some points of interest using the GPU. The detection is a heavyweight test in terms of processing, however only about 1 in 25 points pass the test on average. The final stage of the algorithm is to build up a list of the points. On the CPU this would be implemented as:

forall pixels x,y
{
    if(test_this_pixel(x,y))
        vector_of_coordinates.push_back(Vec2(x,y));
}

On the GPU I have each CUDA block processing 16x16 pixels. The problem is that I need to do something special to eventually have a single consolidated list of points in global memory. At the moment I am trying to generate a local list of points in shared memory per block which eventually will be written to global memory. I am trying to avoid sending anything back to the CPU because there are more CUDA stages after this.

I was expecting that I could use atomic operations to implement the push_back function on shared memory. However I am unable to get this working. There are two issues. The first annoying issue is that I am constantly running into the following compiler crash:

nvcc error : 'ptxas' died with status 0xC0000005 (ACCESS_VIOLATION)

when using atomic operations. It is hit or miss whether I can compile something. Does anyone know what causes this?

The following kernel will reproduce the error:

__global__ void gpu_kernel(int w, int h, RtmPoint *pPoints, int *pCounts)
{
    __shared__ unsigned int test;
    atomicInc(&test, 1000);
}

Secondly, my code which includes a mutex lock on shared memory hangs the GPU and I don't understand why:

__device__ void lock(unsigned int *pmutex)
{
    while(atomicCAS(pmutex, 0, 1) != 0);
}

__device__ void unlock(unsigned int *pmutex)
{
    atomicExch(pmutex, 0);
}

__global__ void gpu_kernel_non_max_suppress(int w, int h, RtmPoint *pPoints, int *pCounts)
{
    __shared__ RtmPoint localPoints[64];
    __shared__ int localCount;
    __shared__ unsigned int mutex;

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    int threadid = threadIdx.y * blockDim.x + threadIdx.x;
    int blockid = blockIdx.y * gridDim.x + blockIdx.x;

    if(threadid==0)
    {
        localCount = 0;
        mutex = 0;
    }

    __syncthreads();

    if(x<w && y<h)
    {
        if(some_test_on_pixel(x,y))
        {
            RtmPoint point;
            point.x = x;
            point.y = y;

            // this is a local push_back operation
            lock(&mutex);
            if(localCount<64) // we should never get >64 points per block
                localPoints[localCount++] = point;
            unlock(&mutex);
        }
    }

    __syncthreads();

    if(threadid==0)
        pCounts[blockid] = localCount;
    if(threadid<localCount)
        pPoints[blockid * 64 + threadid] = localPoints[threadid];
}

In the example code at this site, the author manages to successfully use atomic operations on shared memory, so I am confused as to why my case does not function. If I comment out the lock and unlock lines, the code runs ok, but obviously incorrectly adding to the list.

I would appreciate some advice about why this problem is happening and also perhaps if there is a better solution to achieving the goal, since I am concerned anyway about the performance issues with using atomic operations or mutex locks.

Solution

I suggest using prefix-sum to implement that part to increase parallelism. To do that you need to use a shared array. Basically prefix-sum will turn an array (1,1,0,1) into (0,1,2,2,3), i.e., will calculate an in-place running exclusive sum so that you'll get per-thread write indices.

__shared__ uint8_t vector[NUMTHREADS];

....

bool emit  = (x<w && y<h);
     emit  = emit && some_test_on_pixel(x,y);
__syncthreads();
scan(emit, vector);
if (emit) {
     pPoints[blockid * 64 + vector[TID]] = point;
}

prefix-sum example:

    template <typename T>
__device__ uint32 scan(T mark, T *output) {
#define GET_OUT (pout?output:values)
#define GET_INP (pin?output:values)
  __shared__ T values[numWorkers];
  int pout=0, pin=1;
  int tid = threadIdx.x;

  values[tid] = mark;

  syncthreads();

  for( int offset=1; offset < numWorkers; offset *= 2) {
    pout = 1 - pout; pin = 1 - pout;
    syncthreads();
    if ( tid >= offset) {
      GET_OUT[tid] = (GET_INP[tid-offset]) +( GET_INP[tid]);
    }
    else {
      GET_OUT[tid] = GET_INP[tid];
    }
    syncthreads();
  }

  if(!pout)
    output[tid] =values[tid];

  __syncthreads();

  return output[numWorkers-1];

#undef GET_OUT
#undef GET_INP
}