algorithm parallel-processing cuda stream-compaction

CUDA stream compaction algorithm

I'm trying to construct a parallel algorithm with CUDA that takes an array of integers and removes all of the 0's with or without keeping the order.

Example:

Global Memory: {0, 0, 0, 0, 14, 0, 0, 17, 0, 0, 0, 0, 13}

Host Memory Result: {17, 13, 14, 0, 0, ...}

The simplest way is to use the host to remove the 0's in O(n) time. But considering I have around 1000 elements, it probably will be faster to leave everything on the GPU and condense it first, before sending it.

The preferred method would be to create an on-device stack, such that each thread can pop and push (in any order) onto or off of the stack. However, I don't think CUDA has an implementation of this.

An equivalent (but much slower) method would be to keep attempting to write, until all threads have finished writing:

kernalRemoveSpacing(int * array, int * outArray, int arraySize) {
    if (array[threadId.x] == 0)
        return;

    for (int i = 0; i < arraySize; i++) {

         array = arr[threadId.x];

         __threadfence();

         // If we were the lucky thread we won! 
         // kill the thread and continue re-reincarnated in a different thread
         if (array[i] == arr[threadId.x])
             return;
    }
}

This method has only benefit in that we would perform in O(f(x)) time, where f(x) is the average number of non-zero values there are in an array (f(x) ~= ln(n) for my implementation, thus O(ln(n)) time, but has a high O constant)

Finally, a sort algorithm such as quicksort or mergesort would also solve the problem, and does in fact run in O(ln(n)) relative time. I think there might be an algorithm faster than this even, as we do not need to waste time ordering (swapping) zero-zero element pairs, and non-zero non-zero element pairs (the order does not need to be kept).

So I'm not quite sure which method would be the fastest, and I still think there's a better way of handling this. Any suggestions?

Solution

What you are asking for is a classic parallel algorithm called stream compaction¹.

If Thrust is an option, you may simply use thrust::copy_if. This is a stable algorithm, it preserves relative order of all elements.

Rough sketch:

#include <thrust/copy.h>

template<typename T>
struct is_non_zero {
    __host__ __device__
    auto operator()(T x) const -> bool {
        return x != 0;
    }
};

// ... your input and output vectors here

thrust::copy_if(input.begin(), input.end(), output.begin(), is_non_zero<int>());

If Thrust is not an option, you may implement stream compaction yourself (there is plenty of literature on the topic). It's a fun and reasonably simple exercise, while also being a basic building block for more complex parallel primitives.

⁽¹⁾ Strictly speaking, it's not exactly stream compaction in the traditional sense, as stream compaction is traditionally a stable algorithm but your requirements do not include stability. This relaxed requirement could perhaps lead to a more efficient implementation?