Search code examples
pythonnumpymemorymemory-managementnumpy-memmap

Numpy memmap still using RAM instead of disk while doing vector operation


I initialize two operands and one result:

a = np.memmap('a.mem', mode='w+', dtype=np.int64, shape=(2*1024*1024*1024))
b = np.memmap('b.mem', mode='w+', dtype=np.int64, shape=(2*1024*1024*1024))
result = np.memmap('result.mem', mode='w+', dtype=np.int64, shape=(2*1024*1024*1024))

At idle state like that, the System RAM reported by Google Colab still 1.0/12.7 GB which is good there is no RAM activities yet. But, doing this vector operation such as vector substraction, the reported system ram increased to the almost maximum peak which is 11.2/12.7 GB that eventually the runtime kernel is crashed:

result[:] = a[:] - b[:] # This still consume memory
result.flush()

I have read np.memmap docs many times, it was asserted that the purpose of memmap is supposed to reduce memory consumption, but why I still got Out Of Memory error?

I suspect, the vector subtraction must be buffered into small chunk such as for every 512MB buffer memory. But I have no idea what the syntax is: Perhaps what I mean is something like this:

BUFF_SIZE = 512 * 1024 * 1024
for i in range(0, result.size, BUFF_SIZE):
  result[i:i+BUFF_SIZE] = a[i:i+BUFF_SIZE] - b[i:i+BUFF_SIZE]
  result.flush()

Solution

  • result[:] = a[:] - b[:] doesn't mean "write the subtraction results directly into result". It means "write the subtraction results into a new array, then copy the contents of that array into result". You're attempting to create a 16 GiB temporary array in the middle.

    To write the output directly into result, you can use the out parameter of the numpy.subtract ufunc:

    numpy.subtract(a, b, out=c)