Asynchronous GPU memory transfer with cupy

Is it possible to asynchronously transfer memory from/to GPU with cupy (or chainer)?

I'm training a relatively small network with very large data that does not fit into the GPU memory. This data should be kept on CPU memory and provided to GPU for its minibatch calculation sequentially.

The memory transfer time is the dominant bottleneck of this application. I think the asynchronous memory transfer solves this problem, i.e. during the calculation of one minibatch, another minibatch is transferred to GPU in the background.

I'm wondering it would be possible with cupy.cuda.Stream class, but I have no idea yet. I would appreciate any comments/advice.

EDIT: I thought the following codes makes asynchronous memory transfer, but not.

import numpy as np
import cupy as cp

a_cpu = np.ones((10000, 10000), dtype=np.float32)
b_cpu = np.ones((10000, 10000), dtype=np.float32)

a_stream = cp.cuda.Stream(non_blocking=True)
b_stream = cp.cuda.Stream(non_blocking=True)

a_gpu = cp.empty_like(a_cpu)
b_gpu = cp.empty_like(b_cpu)

a_gpu.set(a_cpu, stream=a_stream)
b_gpu.set(b_cpu, stream=b_stream)

# This should start before b_gpu.set() is finished.
a_gpu *= 2

The nvvp shows the memory transfer takes place sequentially.

Solution

I found one solution by diving into chainer source code.

An essential point seems to keep a fixed memory buffer when constructing np.ndarray.

def pinned_array(array):
    # first constructing pinned memory
    mem = cupy.cuda.alloc_pinned_memory(array.nbytes)
    src = numpy.frombuffer(
                mem, array.dtype, array.size).reshape(array.shape)
    src[...] = array
    return src

a_cpu = np.ones((10000, 10000), dtype=np.float32)
b_cpu = np.ones((10000, 10000), dtype=np.float32)
# np.ndarray with pinned memory
a_cpu = pinned_array(a_cpu)
b_cpu = pinned_array(b_cpu)

a_stream = cp.cuda.Stream(non_blocking=True)
b_stream = cp.cuda.Stream(non_blocking=True)

a_gpu = cp.empty_like(a_cpu)
b_gpu = cp.empty_like(b_cpu)

a_gpu.set(a_cpu, stream=a_stream)
b_gpu.set(b_cpu, stream=b_stream)

# wait until a_cpu is copied in a_gpu
a_stream.synchronize()
# This line runs parallel to b_gpu.set()
a_gpu *= 2