Is there a scatter primitive in OpennCL?

I am implementing the split operation as described here: https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html. Currently I am try to implement the scatter step.

scatter simply performs an operation similar to OpenCL's shuffle operation (here: https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/shuffle.html) but on general work group (__local) arrays instead of vectors.

I tried to find the support for this operation in OpenCL docs but with no luck.

Is there native OpenCL support for this operation, and if not is there some paper or article where I can find an efficient algorithm to implement this operation?

A note that might be helpful, I am using AMD's implementation so if there is a solution provided by AMD specifically, it would be fine. However, a general solution is much better.

Here is what I trying to accomplish in code. This is an OpenCL function that will be later called by another kernel that implements radix sort:

void Split(__local uint local_keys[256], uint cur_bit, int local_id)
{
__local uint e[256];
__local uint f[256];
__local uint t[256];
__local uint d[256];

e[local_id] = !(local_keys[local_id] & (1 << cur_bit));
f[local_id] =  work_group_scan_exclusive_add(e[local_id]);

work_group_barrier(CLK_LOCAL_MEM_FENCE);

uint total_falses = e[255] + f[255];
t[local_id] = local_id - f[local_id] + total_falses;
d[local_id] = e[local_id] ? f[local_id] : t[local_id];

work_group_barrier(CLK_LOCAL_MEM_FENCE);

// TODO scatter .... (Please see last step  of the attached image)
.......
}

Solution

In my opinion, the only solution is to use a helper array. Special instructions like the one you point out, are only valid for arrays that can be hold in registers. They are not valid when you cross the local memory boundaries since then it needs work item syncronizations.

During your process you use 4 local uint vectors. Which means you need that amount of local memory allocated at the same time. Allocating therefore a helper array to do the shuffling should not use any more memory than what you have been already using so far.

NOTE: Compiler will realize that the scope of e,f,t dies after you get the value d calculated. So temp array should not use extra memory, because it can reuse the previous memory. You can also hint that to the compiler by using { }.

void Split(__local uint local_keys[256], uint cur_bit, int local_id)
{
__local uint e[256];
__local uint f[256];
__local uint t[256];
__local uint d[256];
__local uint temp[256];

e[local_id] = !(local_keys[local_id] & (1 << cur_bit));
f[local_id] =  work_group_scan_exclusive_add(e[local_id]);

work_group_barrier(CLK_LOCAL_MEM_FENCE);

uint total_falses = e[255] + f[255];
t[local_id] = local_id - f[local_id] + total_falses;
d[local_id] = e[local_id] ? f[local_id] : t[local_id];

temp[local_id] = local_keys[local_id];

work_group_barrier(CLK_LOCAL_MEM_FENCE);

local_keys[d[local_id]] = temp[local_id];
}