Search code examples
pythoniohdf5h5py

How does the writing process work in h5py Datasets?


I am using the following syntax to overwrite part of an hdf5 file in Python:

import h5py

f = h5py.File(file_path, 'r')
dset = f["mykey"]
dset[:3] = [1,2,3]
f.close()

It seems to be working but I could not find information in the documentation about how this update is made. I am wondering if the dataset is (1) loaded in memory, (2) updated, (3) entirely written back, or if it just updates the piece of data on disk.

I am asking this because I want to recode it for npy files and I have the choice between loading the data, updating it and rewriting it or just using seek and making only the necessary update on disk.


Solution

  • So have you studied the h5py docs, especially the page about datasets? It's all there.

    Here's what I've deduced from reading those docs and answering a variety of SO.

    f = h5py.File(file_path, 'r')
    dset = f["mykey"]
    

    dset is the dataset object, that's located on the file.

    arr = dset[:]
    

    would load the dataset into a numpy array.

    dset[:3] = [1,2,3]
    

    this on the other hand, writes np.array([1,2,3]) to the dataset on the file; that is, it will modify the first 3 elements of the file object.

    f.close()
    

    Due to buffering etc, that write might not actually happen until the f is flushed or closed.

    Since it is possible to load just a portion of the dataset

    arr = dset[:3]
    

    I deduce it can perform the write without loading the whole dset. The actual code is a mix of python, c++, with cython as the bridge.