I have the following algorithm:
__global__ void Update(int N, double* x, double* y, int* z, double* out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
{
x[i] += y[i];
if (y[i] >= 0.)
out[z[i]] += x[i];
else
out[z[i]] -= x[i];
}
}
Important to note that out is smaller than x. Say x, y and z are always the same size, say 1000, and out is always smaller, say 100. z is the indices in out that each of x and y correspond to.
This is all find except the updates to out. There may be clashes across threads as z does not contain only unique values and has duplicates. Therefore I currently have this implemented with atomic versions of atomicAdd and subtract using compare and swap. This is obviously expensive and means my kernel takes 5-10x longer to run.
I would like to reduce this however the only way I can think of doing this is for each thread to have its own version of out (which can be large, 10000+, X 10000+ threads). This would mean I set up 10000 double[10000] (perhaps in shared?) call my kernel, and then sum across these arrays, perhaps in another kernel. Surely there must be a more elegant way to do this?
It might be worth noting that x, y, z and out reside in global memory. As my kernel (I have others like this) is very simple I have not decided to copy across bits to shared (nvvp on the kernel shows equal computation and memory so I am thinking not much performance to be gained when adding overhead of moving data from global to shared and back again, any thoughts?).
Method 1:
Build a set of "transactions". Since you only have one update per thread, you can easily build a fixed size "transaction" record, one entry per thread. Suppose I have 8 threads (for simplicity of presentation) and some arbitrary number of entries in my out
table. Let's suppose my 8 threads wanted to do 8 transactions like this:
thread ID (i): 0 1 2 3 5 6 7
z[i]: 2 3 4 4 3 2 3
x[i]: 1.5 0.5 1.0 0.5 0.1 -0.2 -0.1
"transaction": 2,1.5 3,0.5 4,1.0 4,0.5 3,0.1 2,-0.2 3,-0.1
Now do a sort_by_key on the transactions, to arrange them in order of z[i]
:
sorted: 2,1.5 2,-0.2 3,0.5 3,-0.1 3,0.1 4,1.0 4,0.5
Now do a reduce_by_key operation on the transactions:
keys: 2 3 4
values: 1.3 0.5 1.5
Now update out[i]
according to the keys:
out[2] += 1.3
out[3] += 0.5
out[4] += 1.5
thrust and/or cub might be pre-built options for the sort and reduce operations.
Method 2:
As you say, you have arrays x
, y
, z
, and out
in global memory. If you are going to use z
which is a "mapping" repeatedly, you might want to rearrange (group) or sort your arrays in order of z
:
index (i): 0 1 2 3 4 5 6 7
z[i]: 2 8 4 8 3 1 4 4
x[i]: 0.2 0.4 0.3 0.1 -0.1 -0.4 0.0 1.0
group by z[i]:
index (i): 0 1 2 3 4 5 6 7
z[i]: 1 2 3 4 4 4 8 8
x[i]:-0.4 0.2 -0.1 0.3 0.0 1.0 0.4 0.1
This, or some variant of it, would allow you to eliminate having to repeatedly do the sorting operation in method 1 (again, if you were using the same "mapping" vector repeatedly).