How to go from a sparse array representation to a dense one in CUDA

I'm a beginner at CUDA and trying to figure out the most efficient way to do something.

I have an array of values. I want to build an array that's essentially the number of times each value appears in the array. Is there an efficient algorithm to do this using CUDA?

For example, let's assume that the values range from 0-10. In reality I have negative values as well, but I need to ignore those. (EDIT: I've tried thrust::remove_if followed by thrust::reduce_by_key, but I'm looking for something that's more efficient in terms of ignoring the elements I don't care about. Almost like a thrust::reduce_by_key_if). The list is much much smaller than the range of the values (ie, the vast majority of values are outside the range I care about). I might have:

int32_t values[5] = {3, 5, 2, 5, 1, -1};

And I want to build the array:

int32_t result[10] = {0, 1, 1, 1, 0, 2, 0, 0, 0, 0};

Right now I'm doing it on the CPU mostly. I've tried sorting the list of indices it using thrust to improve memory caching performance, but the performance improvements there are marginal at best.

Any thoughts? Is there an elegant way to do this?

Solution

thrust::remove_if followed by thrust::reduce_by_key ended up being the best way to do it for my application.