How to gather rows from a matrix by indices list using CUDA Thrust

This is seemingly a simple problem but I just can’t figure out an elegant way to do this with CUDA Thrust.

I have a two dimensional matrix NxM and a vector of desired row indices of size L that is a subset of all rows(i.e. L < N) and is not regular (basically an irregular list like, 7,11,13,205,... etc.). The matrix is stored by rows in a thrust device vector. The array of indices is a device vector as well. Here are my two questions:

What is the most efficient way to copy the desired rows from the original NxM matrix forming a new matrix LxM?
Is it possible to create an iterator for the original NxM matrix that would dereference to only elements that belong to the desired rows?

Thank you very much for your help.

Solution

What you are asking about seems like a pretty straight forward stream compaction problem, and there isn't any particular problem doing it with thrust, but there are a couple of twists. In order to select the rows to copy, you need to have an stencil or key that the stream compaction algorithm can use. That needs to be constructed by a search or select operation using your list of rows to copy.

One example procedure to do this would go something like this:

Construct an iterator which returns the row number of any entry in the input matrix. Thrust has a very useful counting_iterator and transform_iterator which can be combined to do this
Perform a search of that row number iterator to find which entries match the list of rows to copy. thrust::binary search can be used for this. The search yields the stencil for the stream compaction operation
Use thrust::copy_if to perform the stream compaction on the input matrix with the stencil.

It sounds like a lot of work and intermediate steps, but the counting and transformation iterators don't actually produce any intermediate device vectors. The only intermediate storage required is the stencil array, which can be a boolean (so m*n bytes).

A full example in code:

#include <thrust/copy.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>

struct div_functor : public thrust::unary_function<int,int>
{
    int m;
    div_functor(int _m) : m(_m) {};

    __host__ __device__
    int operator()(int x) const
    {
        return x / m;
    }
};

struct is_true
{
    __host__ __device__
    bool operator()(bool x) { return x; }
};


int main(void)
{

    // dimensions of the problem
    const int m=20, n=5, l=4;

    // Counting iterator for generating sequential indices

    // Sample matrix containing 0...(m*n)
    thrust::counting_iterator<float> indices(0.f);
    thrust::device_vector<float> in_matrix(m*n);
    thrust::copy(indices, indices+(m*n), in_matrix.begin());

    // device vector contain rows to select
    thrust::device_vector<int> select(l);
    select[0] = 1;
    select[1] = 4;
    select[2] = 9;
    select[3] = 16;

    // construct device iterator supplying row numbers via a functor
    typedef thrust::counting_iterator<int> counter;
    typedef thrust::transform_iterator<div_functor, counter> rowIterator;
    rowIterator rows_begin = thrust::make_transform_iterator(thrust::make_counting_iterator(0), div_functor(n));
    rowIterator rows_end = rows_begin + (m*n);

    // constructor a stencil array which indicates which entries will be copied
    thrust::device_vector<bool> docopy(m*n);
    thrust::binary_search(select.begin(), select.end(), rows_begin, rows_end, docopy.begin());

    // use stream compaction on the matrix with the stencil array
    thrust::device_vector<float> out_matrix(l*n);
    thrust::copy_if(in_matrix.begin(), in_matrix.end(), docopy.begin(), out_matrix.begin(), is_true());

    for(int i=0; i<(l*n); i++) {
        float val = out_matrix[i];
        printf("%i %f\n", i, val);
    }
}

(usual disclaimer: use at your own risk)

About the only comment I would make is that the predicate to the copy_if call feels a bit redundant given we have already a binary stencil that could be used directly, but there doesn't seem to be a variant of the compaction algorithms which can operate on a binary stencil directly. Similarly, I could not think of a sensible way to use the list of rows directly in the stream compaction call. There might well be a more efficient way to do this with thrust, but this should at least get you started.

From your comment, it seems that space is tight and the additional memory overhead of the binary search and stencil creation is prohibitive for your application. In that case I would follow the advice I offered in a comment to Roger Dahl's answer, and use a custom copy kernel instead. Thrust device vectors can be cast to a pointer you can pass directly to a kernel (thrust::raw_pointer_cast), so it need not interfere with your existing thrust code. I would suggest using a block of threads per row to copy, that allows coalescing of reads and writes and should perform a lot better than using thrust::copy for each row. A very simple implementation might look something like this (reusing most of my thrust example):

#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>

__global__ 
void rowcopykernel(const float *in, float *out, const int *list, const int m, const int n, const int l)
{
    __shared__ const float * inrowp; 
    __shared__ float * outrowp;

    if (threadIdx.x == 0) {
        inrowp = (blockIdx.x < l) ? in + (n*list[blockIdx.x]) : 0;
        outrowp = out + (n*blockIdx.x);
    }
    __syncthreads();

    for(int i=threadIdx.x; (inrowp != 0) && (i<n); i+=blockDim.x) {
        *(outrowp+i) = *(inrowp+i);
    }
}

int main(void)
{
    // dimensions of the problem
    const int m=20, n=5, l=4;

    // Sample matrix containing 0...(m*n)
    thrust::counting_iterator<float> indices(0.f);
    thrust::device_vector<float> in_matrix(m*n);
    thrust::copy(indices, indices+(m*n), in_matrix.begin());

    // device vector contain rows to select
    thrust::device_vector<int> select(l);
    select[0] = 1;
    select[1] = 4;
    select[2] = 9;
    select[3] = 16;

    // Output matrix
    thrust::device_vector<float> out_matrix(l*n);

    // raw pointer to thrust vectors
    int * selp = thrust::raw_pointer_cast(&select[0]);
    float * inp = thrust::raw_pointer_cast(&in_matrix[0]);
    float * outp = thrust::raw_pointer_cast(&out_matrix[0]);

    dim3 blockdim = dim3(128);
    dim3 griddim = dim3(l);
    rowcopykernel<<<griddim,blockdim>>>(inp, outp, selp, m, n, l);

    for(int i=0; i<(l*n); i++) {
        float val = out_matrix[i];
        printf("%i %f\n", i, val);
    }
}

(standard disclaimer: use at your own risk).

The execution parameter selection could be made fancier, but otherwise that should be about all that is required. If your rows are very small, you might want to investigate using a warp per row rather than a block (so one block copies several rows). If you have more than 65535 output rows, then you will need to either use a 2D grid, or modify the code to have each block do multiple rows. But, as with the thrust based solution about, this should get you started.