Search code examples
cudanumba

Is cuda.to_device asynchronous?


Does cuda.to_device use the same stream as kernel launches?

It seems that memcpy is synchronous (with respect to the host).

from numba import cuda
import numpy as np

A = np.ones((10000, 10000))
%timeit cuda.to_device(A)
188 ms ± 5.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit cuda.synchronize()
14.5 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n1 cuda.to_device(A)
cuda.synchronize()
82.6 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

If cuda.to_device is synchronous, why would synchronization take longer than 14.5 µs? (synchronous with respect to host ≠ gpu is done?)


The results are similar if I explicitly provide a stream.

stream = cuda.stream()
%timeit cuda.to_device(A, stream=stream)
188 ms ± 4.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n1 cuda.to_device(A, stream=stream)
cuda.synchronize()
82.9 µs ± 6.66 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

"To enqueue the transfer to a stream" makes me think the work of transferring the data is delegated to the stream, in which case simply calling cuda.to_device should be faster (since it would return immediatly).

%%timeit -n1 cuda.synchronize()
cuda.to_device(A, stream=stream)
186 ms ± 30 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Edit: I launched many kernels to confirm that to_device is using the same stream.

start = time.time()

for i in range(1000):
  matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)

print(f'launched kernels: {time.time() - start}')
cuda.to_device(A) # takes a while since waiting for queue to free up
print(f'transferred: {time.time() - start}')
cuda.synchronize() 
print(f'synchronized: {time.time() - start}') # time is almost the same, meaning cuda.to_device waited for other kernels to finish
launched kernels: 0.16392898559570312
transferred: 29.859858512878418
synchronized: 29.860819101333618

Edit 2: There is a staging buffer between the CPU and GPU. to_device returns after sending the data to the staging buffer, and the additional latency comes from the staging buffer sending the data to the GPU. Not sure why staged->GPU (82.6 µs - 14.5 µs) is so much faster than CPU->staged (188 ms).


Solution

  • In CUDA, explicitly providing a stream is not sufficient to get a transfer to be asynchronous.

    Generally speaking, you also need to use pinned memory. If the transfer is between a device allocation and a pinned host allocation, and you specify a non-null stream, the transfer should meet the requirements to become asynchronous.

    You can find more info and examples here.

    in which case simply calling cuda.to_device should be faster.

    Let's make sure we have clarity there for future readers. The transfer activity itself is unlikely to be faster simply by enqueueing it in a stream. The thing that you're suggesting will be faster is the apparent time it takes from an API perspective. A properly set up async transfer (using a created stream, to or from pinned memory) should only appear to take the time it takes to put the operation into the stream, which may not include the copy time. Conversely, if you don't meet the requirements for an async transfer, then from an API perspective, you are going to witness the actual transfer time, in the API call itself.

    The actual transfer, whenever it may occur, is not likely to be any quicker just because you did or didn't put it into a stream.