Cupy OutOfMemoryError when trying to cupy.load larger dimension .npy files in memory map mode, but np.load works fine

I'm trying to load some larger .npy files in cupy with memory mapped mode, but I keep running into OutOfMemoryError .

I thought that since it's being opened in memory mapped mode, this operation shouldn't take much memory since a memory map doesn't actually load the whole array into memory.

I can load these files with np.load just fine, this only seems to happen with cupy.load. My enviroment is Google Colab, with the Tesla K80 GPU. It has about 12 gigs CPU ram, 12 gigs GPU ram, and 350 gb disk space.

Here is a minimal example to reproduce the error:

import os
import numpy as np
import cupy

#Create .npy files. 
for i in range(4):
    numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 128 ))
    np.save( 'reg.memmap'+str(i) , numpyMemmap )
    del numpyMemmap
    os.remove( 'reg.memmap'+str(i) )

# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
    NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
del NPYmemmap

# Eventually results in memory error. 
CPYmemmap = []
for i in range(4):
    print(i)
    CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )

Output:

0
1
/usr/local/lib/python3.6/dist-packages/cupy/creation/from_data.py:41: UserWarning: Using synchronous transfer as pinned memory (5120000000 bytes) could not be allocated. This generally occurs because of insufficient host memory. The original error was: cudaErrorMemoryAllocation: out of memory
  return core.array(obj, dtype, copy, order, subok, ndmin)
2
3
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
<ipython-input-4-b5c849e2adba> in <module>()
      2 for i in range(4):
      3     print(i)
----> 4     CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )

1 frames
/usr/local/lib/python3.6/dist-packages/cupy/io/npz.py in load(file, mmap_mode)
     47     obj = numpy.load(file, mmap_mode)
     48     if isinstance(obj, numpy.ndarray):
---> 49         return cupy.array(obj)
     50     elif isinstance(obj, numpy.lib.npyio.NpzFile):
     51         return NpzFile(obj)

/usr/local/lib/python3.6/dist-packages/cupy/creation/from_data.py in array(obj, dtype, copy, order, subok, ndmin)
     39 
     40     """
---> 41     return core.array(obj, dtype, copy, order, subok, ndmin)
     42 
     43 

cupy/core/core.pyx in cupy.core.core.array()

cupy/core/core.pyx in cupy.core.core.array()

cupy/core/core.pyx in cupy.core.core.ndarray.__init__()

cupy/cuda/memory.pyx in cupy.cuda.memory.alloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

cupy/cuda/memory.pyx in cupy.cuda.memory._try_malloc()

OutOfMemoryError: out of memory to allocate 5120000000 bytes (total 20480000000 bytes)

I am also wondering if this is perhaps related to Google Colab and their enviroment/GPU.

For convenience, here is a Google Colab notebook of this minimal code

https://colab.research.google.com/drive/12uPL-ZnKhGTJifZGVdTN7e8qBRRus4tA

Solution

The numpy.load mechanism for a disk file when memory-mapped may not require the entire file to be loaded from disk into host memory.

However it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.

Your particular test case appears to be creating 4 disk files of ~5GB size each. These won't all fit in either host or device memory if you have 12GB of each. Therefore I would expect things to fail on the 3rd file load, if not earlier.

It may be possible to use your numpy.load mechanism with mapped memory, and then selectively move portions of that data to the GPU with cupy operations. In that case, the data size on the GPU would still be limited to GPU RAM, for the usual things like cupy arrays.

Even if you could used CUDA pinned "zero-copy" memory, it would still be limited to the host memory size (12GB, here) or less.