Search code examples
pythonhdf5h5py

Creating HDF5 virtual dataset for dynamic data using h5py


I have a HDF5 file which contains three 1D arrays in different datasets. This file is created using h5py in Python and the 1D arrays are continually being appended to (ie growing). For simplicity, let’s call these 1D arrays as “A”, “B” and “C” and let’s say each array initially contains 100 values, but every second they will grow by one value (eg 101, 102 etc).

What I’m looking to do is create a single virtual dataset which is the concatenation of all three 1D arrays. This is relatively easy for the static case (3 x 100 values) but I want this virtual dataset to grow as more values are added (eg 303 values at 1 second, 306 at seconds etc.).

Is there a pythonic / efficient way to do this which isn’t just delete the virtual dataset and recreate it each second?


Solution

  • You don't have to delete the virtual dataset and recreate it when you add data. You can avoid this by using resizeable datasets and a resizeable VDS VirtualLayout (e.g. using the maxshape= parameter). In addition, use the h5py.h5s.UNLIMITED value to create an unlimited selection along an axis of the data source and VDS layout. They are described in the h5py docs here:

    The solution posted below will accomplish this task.

    However, a word of warning before you implement it. HDF5/h5py I/O performance slows when you write a lot of small data blocks. Your example may be painfully slow. It's better to occasionally append large blocks of data than it is to add frequently append small blocks. (e.g. It's better to add 60*60 values every hour than it is it add 1 value every second.)

    Here's is a solution that creates 3 resizeable datasets and a resizeable VirtualLayout. The UNLIMITED value is used for the slice definition from Source to the Layout.

    num_dsets = 3
    a0 = 100
    UNLIMITED = h5py.h5s.UNLIMITED
    
    with h5py.File("SO_78415089.h5", "w") as h5f:
    
        dset_names = [ f'dset_{i:02d}' for i in range(num_dsets)]
        # Create virtual layout
        vds_layout = h5py.VirtualLayout(shape=(num_dsets,a0), maxshape=(num_dsets,None), dtype="int")
    
        for i, dset_name in enumerate(dset_names):
            # create data and load to dataset
            arr_data = np.arange(a0*i,a0*(i+1))
            h5f.create_dataset(dset_name,data=arr_data,maxshape=(None,))
       
            # Create virtual source and map to layout
            vsource = h5py.VirtualSource(h5f[dset_name])
            vds_layout[i, :UNLIMITED] = vsource[:UNLIMITED]
            
        # Add virtual layouts to virtual dataset
        h5f.create_virtual_dataset("vdata", vds_layout, fillvalue=-1)
     
        # resize datasets and add more values     
        for i, dset_name in enumerate(dset_names):
            c0 = h5f[dset_name].shape[0]
            h5f[dset_name].resize((c0+a0,))
            arr_data = np.arange(c0+a0*i,c0+a0*(i+1))
            h5f[dset_name][c0:c0+a0] = arr_data