Reduction or atomic operator on unknown global array indices

I have the following algorithm:

__global__ void Update(int N, double* x, double* y, int* z, double* out)
{
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < N)
    {
      x[i] += y[i];
      if (y[i] >= 0.)
        out[z[i]] += x[i];
      else
        out[z[i]] -= x[i];
    }
}

Important to note that out is smaller than x. Say x, y and z are always the same size, say 1000, and out is always smaller, say 100. z is the indices in out that each of x and y correspond to.

This is all find except the updates to out. There may be clashes across threads as z does not contain only unique values and has duplicates. Therefore I currently have this implemented with atomic versions of atomicAdd and subtract using compare and swap. This is obviously expensive and means my kernel takes 5-10x longer to run.

I would like to reduce this however the only way I can think of doing this is for each thread to have its own version of out (which can be large, 10000+, X 10000+ threads). This would mean I set up 10000 double[10000] (perhaps in shared?) call my kernel, and then sum across these arrays, perhaps in another kernel. Surely there must be a more elegant way to do this?

It might be worth noting that x, y, z and out reside in global memory. As my kernel (I have others like this) is very simple I have not decided to copy across bits to shared (nvvp on the kernel shows equal computation and memory so I am thinking not much performance to be gained when adding overhead of moving data from global to shared and back again, any thoughts?).

Solution

Method 1:

Build a set of "transactions". Since you only have one update per thread, you can easily build a fixed size "transaction" record, one entry per thread. Suppose I have 8 threads (for simplicity of presentation) and some arbitrary number of entries in my out table. Let's suppose my 8 threads wanted to do 8 transactions like this:
```
thread ID (i):  0      1      2      3      5      6      7
z[i]:           2      3      4      4      3      2      3
x[i]:           1.5    0.5    1.0    0.5    0.1    -0.2   -0.1
"transaction":  2,1.5  3,0.5  4,1.0  4,0.5  3,0.1  2,-0.2 3,-0.1
```
Now do a sort_by_key on the transactions, to arrange them in order of z[i]:
```
sorted:         2,1.5  2,-0.2 3,0.5  3,-0.1 3,0.1  4,1.0  4,0.5
```

Now do a reduce_by_key operation on the transactions:

keys:           2      3      4    
values:         1.3    0.5    1.5

Now update out[i] according to the keys:

          out[2] += 1.3
          out[3] += 0.5
          out[4] += 1.5

thrust and/or cub might be pre-built options for the sort and reduce operations.

Method 2:

As you say, you have arrays x, y, z, and out in global memory. If you are going to use z which is a "mapping" repeatedly, you might want to rearrange (group) or sort your arrays in order of z:

    index (i): 0      1      2       3      4       5       6      7
         z[i]: 2      8      4       8      3       1       4      4
         x[i]: 0.2    0.4    0.3     0.1   -0.1    -0.4     0.0    1.0

group by z[i]:

    index (i): 0      1      2       3      4       5       6      7
         z[i]: 1      2      3       4      4       4       8      8
         x[i]:-0.4    0.2   -0.1     0.3    0.0     1.0     0.4    0.1

This, or some variant of it, would allow you to eliminate having to repeatedly do the sorting operation in method 1 (again, if you were using the same "mapping" vector repeatedly).