Search code examples
numpy-memmap

Can operations on a numpy.memmap be deferred?


Consider this example:

import numpy as np
a = np.array(1)
np.save("a.npy", a)

a = np.load("a.npy", mmap_mode='r')
print(type(a))

b = a + 2
print(type(b))

which outputs

<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>

So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?

I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.

Here is an extended example showing my problem:

import numpy as np

# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))

# I want to print the first value using f and memmaps


def f(value):
    print(value[1])


# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)

# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)

Solution

  • Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
    I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable. For a very specific case it may be easier than redesigning your code to solve the problem in a better way. I'd recommend reading over these examples from the docs to help understand how it works.

    import numpy as np  
    class Defered(np.ndarray):
          """
          An array class that deferrs calculations applied to it, only
          calculating them when an index is requested
          """
          def __new__(cls, arr):
                arr = np.asanyarray(arr).view(cls)
                arr.toApply = []
                return arr
    
          def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
                ## Convert all arguments to ndarray, otherwise arguments
                # of type Defered will cause infinite recursion
                # also store self as None, to be replaced later on
                newinputs = []
                for i in inputs:
                      if i is self:
                            newinputs.append(None)
                      elif isinstance(i, np.ndarray):
                            newinputs.append(i.view(np.ndarray))
                      else:
                            newinputs.append(i)
    
                ## Store function to apply and necessary arguments
                self.toApply.append((ufunc, method, newinputs, kwargs))
                return self
    
          def __getitem__(self, idx):
                ## Get index and convert to regular array
                sub = self.view(np.ndarray).__getitem__(idx)
    
                ## Apply stored actions
                for ufunc, method, inputs, kwargs in self.toApply:
                      inputs = [i if i is not None else sub for i in inputs]
                      sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
    
                return sub
    

    This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.