Search code examples
cudathrust

How to asynchronously copy a disjoint subset of an array from device to host with CUDA/Thrust?


I'm not sure if this is possible or not in an asynchronous fashion, but what I'd like to do is the following. Suppose I have the following array on device:

d_arr = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Also suppose I have a host array h_arr of size 3. Lastly, suppose I have the following array of pointers:

p_arr = [&d_arr[0], &d_arr[4], &d_arr[8]]

I would like to call an imaginary function

cudaMemcpyAsyncDisjoint(&d_arr[0], &h_arr[0], &p_arr[0], 3)

Which then fills the array h_arr such that it becomes

h_arr = [0, 4, 8]

I want to do this asynchronously because my main concern is speed, since I have a driver method which runs kernels in a loop and then copies back data at the end of each round.


Solution

  • If the strides between elements of p_arr are constant, then this is possible in a single operation with cudaMemcpy2DAsync.

    For varying strides its not possible in a single operation. Furthermore, the single operation method (with constant stride) is not necessarily the fastest way (the cudaMemcpy2DAsync method does not necessarily get close to expected bus transfer speeds). For the fastest method, plus ability to handle varying strides between elements to be copied, the usual recommendation is to to break this into two steps.

    1. Do something like thrust::gather (or thrust::copy with a permutation iterator) to collect all elements to be copied into a contiguous, temporary device buffer
    2. then use cudaMemcpyAsync to copy that buffer to host.