How do you prevent thrust::reduce_by_key from writing to pageable memory?

I am writing an application which uses several concurrent CUDA streams. My other streams are blocking when my thrust::reduce_by_key call appears to write to pageable memory. I think the returned value is the problem.

How do you prevent the return value from being written to pageable memory?

I will include code that demonstrates my attempted solution.


#include <thrust/system/cuda/vector.h>
#include <thrust/host_vector.h>
#include <thrust/pair.h>
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/random.h>

int main(void)
{
  int N = 20;
  thrust::default_random_engine rng;
  thrust::uniform_int_distribution<int> dist(10, 99);

  // initialize data
  thrust::device_vector<int> array(N);
  for (size_t i = 0; i < array.size(); i++)
    array[i] = dist(rng);

  // allocate storage for sums and indices
  thrust::device_vector<int> sums(N);
  thrust::device_vector<int> indices(N);

  // make a pinned memory location for the returned pair of iterators
  typedef thrust::device_vector<int>::iterator  dIter;
  thrust::pair<dIter, dIter>*  new_end;

  const unsigned int bytes =  sizeof(thrust::pair<dIter, dIter>);
  cudaMallocHost((void**)&new_end, bytes);

  for(int i = 0 ; i< 20; i++){   // you can see in the profiler each operator writes 4 bytes to pageable memory

        *new_end = thrust::reduce_by_key
            (thrust::make_counting_iterator(0),
             thrust::make_counting_iterator(N),
             array.begin(),
             indices.begin(),
             sums.begin(),
             thrust::equal_to<int>(),
             thrust::plus<int>());
  }
  std::cout << "done \n";
  return 0;
}

This is a picture of my profiler showing the copy from device to host pageable memory

Solution

I am writing an application which uses several concurrent CUDA streams. My other streams are blocking when my thrust::reduce_by_key appears to write to pageable memory

This blocking behaviour is not caused by a "write to pageable memory". It is cause by an explicit synchronization call. In general, as of the CUDA 10.1 (Thrust 1.9.4) release, all normal synchronous algorithms are blocking. You can confirm this yourself by examining an API trace with the profiler. However, you might be able to at least limit the scope of the blocking by launching the call into a stream, although I am too lazy to test whether this modifies the behaviour of cuda_cub::synchronize in a useful way or not.

How do you prevent the return value from being written to pageable memory?

Not that this is in any way related to your problem, but you can't. It is important to keep in mind that, contrary to what your original question asserted, thrust::reduce_by_key is not a kernel, it is host code which executes a series of operations, including copying the return value from device memory to a host stack variable. There is no programmer control over the internals, and obviously your attempt to use your own pinned memory value to accept a result passed by value is nonsensical and will have no effect.

As suggested in comments, if you need the level of granularity of control of internals of the operation that your question suggests, then thrust is the wrong choice. Use cub::device::reduce_by_key -- that is the same algorithm implementation thrust uses, but you get explicit control over scratch memory, synchronization, streams, and how to access the result of the call. That is, however, not for beginners.