Search code examples
pythonarraysnumpyhdf5numpy-memmap

Random access in a saved-on-disk numpy array


I have one big numpy array A of shape (2_000_000, 2000) of dtype float64, which takes 32 GB.

(or alternatively the same data split into 10 arrays of shape (200_000, 2000), it may be easier for serialization?).

How can we serialize it to disk such that we can have fast random read access to any part of the data?

More precisely I need to be able to read ten thousands of windows of shape (16, 2 000) from A at random starting indexes i:

L = []
for i in range(10_000):
    i = random.randint(0, 2_000_000 - 16):
    window = A[i:i+16, :]         # window of A of shape (16, 2000) starting at a random index i
    L.append(window)
WINS = np.concatenate(L)   # shape (10_000, 16, 2000) of float64, ie: ~ 2.4 GB

Let's say I only have 8 GB of RAM available for this task; it's totally impossible to load the whole 32 GB of A in RAM.

How can we read such windows in a serialized-on-disk numpy array? (.h5 format or any other)

Note: The fact the reading is done at randomized starting indexes is important.


Solution

  • This example shows how you can use an HDF5 file for the process you describe.

    First, create a HDF5 file with a dataset of shape(2_000_000, 2000) and dtype=float64 values. I used variables for the dimensions so you can tinker with it.

    import numpy as np
    import h5py
    import random
    
    h5_a0, h5_a1 = 2_000_000, 2_000
    
    with h5py.File('SO_68206763.h5','w') as h5f:
        dset = h5f.create_dataset('test',shape=(h5_a0, h5_a1))
        
        incr = 1_000
        a0 = h5_a0//incr
        for i in range(incr):
            arr = np.random.random(a0*h5_a1).reshape(a0,h5_a1)
            dset[i*a0:i*a0+a0, :] = arr       
        print(dset[-1,0:10])  # quick dataset check of values in last row
    

    Next, open the file in read mode, read 10_000 random array slices of shape (16,2_000) and append to the list L. At the end, convert the list to the array WINS. Note, by default the array will have 2 axes -- you need to use .reshape() if you want 3 axes per your comment (reshape also shown).

    with h5py.File('SO_68206763.h5','r') as h5f:
        dset = h5f['test']
        L = []
        ds0, ds1 = dset.shape[0], dset.shape[1]
        for i in range(10_000):
            ir = random.randint(0, ds0 - 16)
            window = dset[ir:ir+16, :]  # window from dset of shape (16, 2000) starting at a random index i
            L.append(window)
        WINS = np.concatenate(L)   # shape (160_000, 2_000) of float64,
        print(WINS.shape, WINS.dtype)
        WINS = np.concatenate(L).reshape(10_0000,16,ds1)   # reshaped to (10_000, 16, 2_000) of float64
        print(WINS.shape, WINS.dtype)
    

    The procedure above is not memory efficient. You wind up with 2 copies of the randomly sliced data: in both list L and array WINS. If memory is limited, this could be a problem. To avoid the intermediate copy, read the random slide of data directly to an array. Doing this simplifies the code, and reduces the memory footprint. That method is shown below (WINS2 is a 2 axis array, and WINS3 is a 3 axis array).

    with h5py.File('SO_68206763.h5','r') as h5f:
        dset = h5f['test']
        ds0, ds1 = dset.shape[0], dset.shape[1]
        WINS2 = np.empty((10_000*16,ds1))
        WINS3 = np.empty((10_000,16,ds1))
        for i in range(10_000):
            ir = random.randint(0, ds0 - 16)
            WINS2[i*16:(i+1)*16,:] = dset[ir:ir+16, :]
            WINS3[i,:,:] = dset[ir:ir+16, :]