Is sort_by_key in thrust a blocking call?

I repeatedly enqueue a sequence of kernels:

for 1..100:
    for 1..10000:
        // Enqueue GPU kernels
        Kernel 1 - update each element of array  
        Kernel 2 - sort array  
        Kernel 3 - operate on array  
    end
    // run some CPU code
    output "Waiting for GPU to finish"
    // copy from device to host
    cudaMemcpy ... D2H(array)
end

Kernel 3 is of order O(N^2) so is by far the slowest of all. For Kernel 2 I use thrust::sort_by_key directly on the device:

thrust::device_ptr<unsigned int> key(dKey);
thrust::device_ptr<unsigned int> value(dValue);
thrust::sort_by_key(key,key+N,value);

It seems that this call to thrust is blocking, as the CPU code only gets executed once the inner loop has finished. I see this because if I remove the call to sort_by_key, the host code (correctly) outputs the "Waiting" string before the inner loop finishes, while it does not if I run the sort.

Is there a way to call thrust::sort_by_key asynchronously?

Solution

First of all consider that there is a kernel launch queue, which can hold only so many pending launches. Once the launch queue is full, additional kernel launches, of any kind are blocking. The host thread will not proceed (beyond those launch requests) until empty queue slots become available. I'm pretty sure 10000 iterations of 3 kernel launches will fill this queue before it has reached 10000 iterations. So there will be some latency (I think) with any sort of non-trivial kernel launches if you are launching 30000 of them in sequence. (eventually, however, when all kernels are added to the queue because some have already completed, then you would see the "waiting..." message, before all kernels have actually completed, if there were no other blocking behavior.)
thrust::sort_by_key requires temporary storage (of a size approximately equal to your data set size). This temporary storage is allocated, each time you use it, via a cudaMalloc operation, under the hood. This cudaMalloc operation is blocking. When cudaMalloc is launched from a host thread, it waits for a gap in kernel activity before it can proceed.

To work around item 2, it seems there might be at least 2 possible approaches:

Provide a thrust custom allocator. Depending on the characteristics of this allocator, you might be able to eliminate the blocking cudaMalloc behavior. (but see discussion below)
Use cub SortPairs. The advantage here (as I see it - your example is incomplete) is that you can do the allocation once (assuming you know the worst-case temp storage size throughout the loop iterations) and eliminate the need to do a temporary memory allocation within your loop.

The thrust method (1, above) as far as I know, will still effectively do some kind of temporary allocation/free step at each iteration, even if you supply a custom allocator. If you have a well-designed custom allocator, it might be that this is almost a "no-op" however. The cub method appears to have the drawback of needing to know the max size (in order to completely eliminate the need for an allocation/free step), but I contend the same requirement would be in place for a thrust custom allocator. Otherwise, if you needed to allocate more memory at some point, the custom allocator is effectively going to have to do something like a cudaMalloc, which will throw a wrench in the works.