Thrust: Stream compaction copying only first N valid elements

I have a const thrust vector of elements from which I would like to extract at most N elements that pass a predicate (in any order), where the thrust vector size and N are known at compile-time. In my specific case, my vector is 500k elements and N is 100k.

My initial thought was to use thrust::copy_if to get all elements that pass the predicate, then to use only the first N elements for my subsequent calculations. However, in that case I would have to allocate two vectors of 500k elements (one for the initial vector, and one for the output of copy_if) and I'd have to process every element.

As this is an operation I have to do many times and across several CUDA streams, I would like to know if there is a way to obtain the N output elements while minimizing the memory footprint required, and ideally, minimizing the number of elements that need to be processed (i.e. breaking the process once N valid elements have been found).

Solution

One possible method to perform a stream compaction operation is to perform a predicated prefix-sum followed by a conditional indexed copy. By breaking a "monolithic" operation into these 2 pieces, it becomes fairly easy to insert the desired limiting behavior on output size.

The prefix sum is a fairly involved operation. We will use thrust for that. The conditional indexed copy is fairly trivial, so we will write our own CUDA kernel for that, rather than try to wrestle with a thrust::copy_if operation to get the copy logic just right. This kernel is where we will insert the limiting behavior on the output size.

Here is a worked example:

$ cat t34.cu
#include <thrust/scan.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>

using namespace thrust::placeholders;

typedef int mt;

__global__ void my_copy(mt *d, int *i, mt *r, int limit, int size){
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < size){
    if ((idx == 0) && (*i == 1) && (limit > 0))
      *r = *d;
    else if ((idx > 0) && (i[idx] > i[idx-1]) && (i[idx] <= limit)){
      r[i[idx]-1] = d[idx];}
  }
}

int main(){
  int rs = 3;
  mt d[] = {0, 1, 0, 2, 0, 3, 0, 4, 0, 5};
  int ds = sizeof(d)/sizeof(d[0]);
  thrust::device_vector<mt> data(d, d+ds);
  thrust::device_vector<int> idx(ds);
  thrust::device_vector<mt> result(rs);
  auto my_cmp = thrust::make_transform_iterator(data.begin(), 0+(_1>0));
  thrust::inclusive_scan(my_cmp, my_cmp+ds, idx.begin());
  my_copy<<<(ds+255)/256, 256>>>(thrust::raw_pointer_cast(data.data()), thrust::raw_pointer_cast(idx.data()), thrust::raw_pointer_cast(result.data()), rs, ds);
  thrust::host_vector<mt> h_result = result;
  thrust::copy_n(h_result.begin(), rs, std::ostream_iterator<mt>(std::cout, ","));
  std::cout << std::endl;
}
$ nvcc -std=c++14 -o t34 t34.cu -arch=sm_52
$ ./t34
1,2,3,
$

(CUDA 11.0, Fedora 29, GTX 960)

Note that this code is provided for demonstration purposes. You should not assume that it is defect-free or suitable for any particular purpose. Use it at your own risk.

A bit of study with a profiler will show that the thrust::inclusive_scan operation does perform a cudaMalloc and cudaFree operation "under the hood". So even though we have pulled most of the allocations "out into the open" here, thrust apparently still needs to perform a single temporary allocation (of unknown size) to support the scan operation.

Responding to a question in the comments below. To understand this: 0+(_1>0), there are two things to note:

The general syntax is using thrust::placeholders. This capability of thrust allows us to write simple unary or binary functions inline, avoiding the need to use lambdas or write separate functors.
The reason for the 0+ is as follows. If we simply used (_1>0), then thrust would use as its unary function a boolean test of the item returned by dereferencing the iterator, compared to zero. The result of that comparison is a boolean, and if we leave it that way, the prefix sum will ultimately be computed using boolean arithmetic, which we do not want. We want the result of the boolean greater-than test (i.e. true/false) to be converted to an integer, so that the subsequent prefix sum gets performed using integer arithmetic. Prepending the (_1>0) boolean test with 0+ accomplishes that.