Search code examples
pythonarraysnumpymemory-mapped-files

Is it possible to np.concatenate memory-mapped files?


I saved a couple of numpy arrays with np.save(), and put together they're quite huge.

Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?


Solution

  • Using numpy.concatenate apparently load the arrays into memory. To avoid this you can easily create a thrid memmap array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.

    For any case you must choose the right order for the array (row-major or column-major).

    The following examples illustrate how to concatenate along axis 0 and axis 1.


    1) concatenate along axis=0

    a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
    a[:,:] = 111
    b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
    b[:,:] = 222
    

    You can define a third array reading the same file as the first array to be concatenated (here a) in mode r+ (read and append), but with the shape of the final array you want to achieve after concatenation, like:

    c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
    c[5000:,:] = b
    

    Concatenating along axis=0 does not require to pass order='C' because this is already the default order.


    2) concatenate along axis=1

    a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
    a[:,:] = 111
    b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
    b[:,:] = 222
    

    The arrays saved on disk are actually flattened, so if you create c with mode=r+ and shape=(5000,4000) without changing the array order, the 1000 first elements from the second line in a will go to the first in line in c. But you can easily avoid this passing order='F' (column-major) to memmap:

    c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
    c[:, 3000:] = b
    

    Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.

    Related questions: