Due to some performance issues with the Thrust libraries (see this page for more details), I am planning on re-factoring a CUDA application to use CUB instead of Thrust. Specifically, to replace the thrust::sort_by_key and thrust::inclusive_scan calls). In a particular point in my application I need to sort 3 arrays by key. This is how I did this with thrust:
thrust::sort_by_key(key_iter, key_iter + numKeys, indices);
thrust::gather_wrapper(indices, indices + numKeys,
thrust::make_zip_iterator(thrust::make_tuple(values1Ptr, values2Ptr, values3Ptr)),
thrust::make_zip_iterator(thrust::make_tuple(valuesOut1Ptr, valuesOut2Ptr, valuesOut3Ptr))
);
where
key iter
is a thrust::device_ptr that points to the keys i want to sort byindices
point to a sequence (from 0 to numKeys-1) in device memoryvalues{1,2,3}Ptr
are device_ptrs to the values i want to sortvalues{1,2,3}OutPtr
are device_ptrs to the sorted valuesWith the CUB SortPairs function I can sort a single value buffer, but not all 3 in one shot. Problem is I don't see any CUB "gather-like" utilities. Suggestions?
EDIT:
I suppose I could implement my own gather kernel, but is there any better way to do this other than:
template <typename Index, typename Value>
__global__ void gather_kernel(const unsigned int N, const Index * map,
const Value * src, Value * dst)
{
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
{
dst[i] = src[map[i]];
}
}
The non-coalesed loads and stores make me chringe, but it probably unavoidable without a known structure on map
.
It seems what you want to achieve depends on thrust::zip_iterator
. You could either
thrust::sort_by_key
by cub::DeviceRadixSort::SortPairs
and keep thrust::gather
, orvalues{1,2,3}
into array of structures before using cub::DeviceRadixSort::SortPairs
After reading the implementation of thrust::gather
,
$CUDA_HOME/include/thrust/system/detail/generic/gather.inl
you can see it is only a naive kernel like
__global__ gather(int* index, float* in, float* out, int len) {
int i=...;
if (i<len) { out[i] = in[index[i]]; }
}
Then I think your code above can be replaced by a single kernel without too much effort.
In this kernel, you could first use the CUB block-wize primitive cub::BlockRadixSort<...>::SortBlockedToStriped
to get the sorted indices stored in registers and then perform a naive re-order copy as thrust::gather
to fill values{1,2,3}Out
.
Using SortBlockedToStriped
rather than Sort
can do coalesced writing (not for reading though) when copying the values
.