Search code examples
pythonchainercupy

Asynchronous GPU memory transfer with cupy


Is it possible to asynchronously transfer memory from/to GPU with cupy (or chainer)?

I'm training a relatively small network with very large data that does not fit into the GPU memory. This data should be kept on CPU memory and provided to GPU for its minibatch calculation sequentially.

The memory transfer time is the dominant bottleneck of this application. I think the asynchronous memory transfer solves this problem, i.e. during the calculation of one minibatch, another minibatch is transferred to GPU in the background.

I'm wondering it would be possible with cupy.cuda.Stream class, but I have no idea yet. I would appreciate any comments/advice.

EDIT: I thought the following codes makes asynchronous memory transfer, but not.

import numpy as np
import cupy as cp

a_cpu = np.ones((10000, 10000), dtype=np.float32)
b_cpu = np.ones((10000, 10000), dtype=np.float32)

a_stream = cp.cuda.Stream(non_blocking=True)
b_stream = cp.cuda.Stream(non_blocking=True)

a_gpu = cp.empty_like(a_cpu)
b_gpu = cp.empty_like(b_cpu)

a_gpu.set(a_cpu, stream=a_stream)
b_gpu.set(b_cpu, stream=b_stream)

# This should start before b_gpu.set() is finished.
a_gpu *= 2

The nvvp shows the memory transfer takes place sequentially.


Solution

  • I found one solution by diving into chainer source code.

    An essential point seems to keep a fixed memory buffer when constructing np.ndarray.

    def pinned_array(array):
        # first constructing pinned memory
        mem = cupy.cuda.alloc_pinned_memory(array.nbytes)
        src = numpy.frombuffer(
                    mem, array.dtype, array.size).reshape(array.shape)
        src[...] = array
        return src
    
    a_cpu = np.ones((10000, 10000), dtype=np.float32)
    b_cpu = np.ones((10000, 10000), dtype=np.float32)
    # np.ndarray with pinned memory
    a_cpu = pinned_array(a_cpu)
    b_cpu = pinned_array(b_cpu)
    
    a_stream = cp.cuda.Stream(non_blocking=True)
    b_stream = cp.cuda.Stream(non_blocking=True)
    
    a_gpu = cp.empty_like(a_cpu)
    b_gpu = cp.empty_like(b_cpu)
    
    a_gpu.set(a_cpu, stream=a_stream)
    b_gpu.set(b_cpu, stream=b_stream)
    
    # wait until a_cpu is copied in a_gpu
    a_stream.synchronize()
    # This line runs parallel to b_gpu.set()
    a_gpu *= 2