Search code examples
pythonshared-memoryhdf5h5pyhdf

Can you read HDF5 dataset directly into SharedMemory with Python?


I need to share a large dataset from an HDF5 file between multiple processes and, for a set of reasons, mmap is not an option.

So I read it into a numpy array and then copy this array into shared memory, like this:

import h5py
from multiprocessing import shared_memory

dataset = h5py.File(args.input)['data']
shm = shared_memory.SharedMemory(
    name=memory_label,
    create=True,
    size=dataset.nbytes
)
shared_tracemap = np.ndarray(dataset.shape, buffer=shm.buf)
shared_tracemap[:] = dataset[:]

But this approach doubles the amount of required memory, because I need to use a temporary variable. Is there a way to read the dataset directly into SharedMemory?


Solution

  • First, an observation: in your code dataset is an h5py dataset object, not an NumPy array. It does not load the entire dataset into memory!

    As @Monday's commented, read_direct() reads directly from a HDF5 dataset to a NumPy array. Use it to avoid making an intermediate copy when slicing.

    This is how to add it to your code. (Note, I suggest including the dtype keyword with your np.ndarray() call.)

    shared_tracemap = np.ndarray(dataset.shape, dtype=dataset.dtype, buffer=shm.buf)
    dataset.read_direct(shared_tracemap)
    

    You can use source_sel= and dest_sel= keywords to read a slice from the dataset. Example:

    dataset.read_direct(shared_tracemap,source_sel=np.s_[0:100],dest_sel=np.s_[0:100])