Search code examples
pythonnumpyhdf5h5py

How to automatically resize an HDF5 dataset with h5py?


Is there a way for an HDF5 dataset to start small in size and automatically be resized to fit more and more items as they are appended into it?

I know that using h5py I can start small and make the dataset "unlimited" in size like so:

dset = file.create_dataset("my_dataset", (1024,), maxshape=(None,))

But AFAIK I still have to resize() the dataset as it nears its current capacity (1024 initially in the above example).

Is there a way for me not to have to resize() explicitly in my code?


Solution

  • Short answer: No.
    I'm not an expert on the underlying HDF5 libraries, but I don't think they have this capability (and h5py is simply a wrapper). The (sort of) good news: h5py will throw an exception if you try to write beyond the allocated size. Code below expands on your example to demonstrate.

    with h5py.File('SO_68389770.h5','w') as h5f:
        dset = h5f.create_dataset("my_dataset", (1024,), maxshape=(None,))
        size = 100
        for i in range(10):
            arr = np.random.random(size)
            start, end = i*size, i*size+size
            dset[start:end] = arr
    

    This works with range(10). You will get this error for range(11):
    TypeError: Can't broadcast (100,) -> (24,)

    Code below handles any size cleanly by checking dset.shape[0] before writing.

    with h5py.File('SO_68389770.h5','w') as h5f:
        dset = h5f.create_dataset("my_dataset", (1024,), maxshape=(None,))
        size = 100
        for i in range(13):
            arr = np.random.random(size)
            start, end = i*size, i*size+size
            if dset.shape[0] >= end :
                dset[start:end] = arr
            else:
                print(f'insufficient dset size, end={end}; resizing')
                dset.resize(end,axis=0)
                dset[start:end] = arr