Search code examples
c++cudathrust

Efficient partial reductions given arrays of elements, offsets to and lengths of sublists


For my application I have to handle a bunch of objects (let's say ints) that gets subsequently divided and sorted into smaller buckets. To this end, I store the elements in a single continuous array

arr = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14...}

and the information about the buckets (sublists) is given by offsets to the first element in the respective bucket and the lengths of the sublist.

So, for instance, given

offsets = {0,3,8,..}
sublist_lengths = {3,5,2,...}

would result in the following splits:

0 1 2 || 3 4 5 6 7 || 8 9 || ...

What I am looking for is a somewhat general and efficient way to run algorithms, like reductions, on the buckets only using either custom kernels or the thrust library. Summing the buckets should give:

3 || 25 || 17 || ...

What I've come up with:

  • option 1: custom kernels require a quite a bit of tinkering, copies into shared memory, proper choice of block and grid sizes and an own implementation of the algorithms, like scan, reduce, etc. Also, every single operation would require an own custom kernel. In general it is clear to me how to do this, but after having used thrust for the last couple of days I have the impression that there might be a smarter way

  • option 2: generate an array of keys from the offsets ({0,0,0,1,1,1,1,1,2,2,3,...} in the above example) and use thrust::reduce_by_key. I don't like the extra list generation, though.

  • option 3: Use thrust::transform_iterator together with thrust::counting_iterator to generate the above given key list on the fly. Unfortunately, I can't come up with an implementation that doesn't require increments of indices to the offset list on the device and defeats parallelism.

What would be the most sane way to implement this?


Solution

  • Within Thrust, I can't think of a better solution than Option 2. The performance will not be terrible, but it's certainly not optimal.

    Your data structure bears similarity to the Compressed Sparse Row (CSR) format for storing sparse matrices, so you could use techniques developed for computing sparse matrix-vector multiplies (SpMV) for such matrices if you want better performance. Note that the "offsets" array of the CSR format has length (N+1) for a matrix with N rows (i.e. buckets in your case) where the last offset value is the length of arr. The CSR SpMV code in Cusp is a bit convoluted, but it serves as a good starting point for your kernel. Simply remove any reference to Aj or x from the code and pass offsets and arr into the Ap and Av arguments respectively.