Search code examples
pythonhdf5h5pypytables

Saving mixed structured data with h5py


I have a dataset with 100,000 entries, each of the form:

{
attr1 float[300]
attr2 float[300]
attr3 float[300]
attr4 float
attr5 float
attr6 float
}

What is the most efficient way to store this in an .hdf5 file?


Solution

  • Without your data (and the data structure) or a code example, it's hard to provide an example specific to your problem. I created a PyTables example that shows the basic operation. There are a lot of ways to define the table structure and input the data. I like to create a np.dtype and reference with description=. In this example, I create and add the data row-by-row using a list with one tuple. However, if you have all the data, you can create an NumPy structured array and reference with the obj= parameter. This will create the array and populate all in one shot

    Here is PyTables example with 100 rows and attr1/2/3 arrays sized to 10 elements. It shows the logic. You can modify to increase the number of rows and array elements.

    All of the PyTables table methods are explained here:
    PyTables table methods

    import tables as tb
    import numpy as np
    
    attr1  = np.arange(10.)
    attr2  = 2.0*np.arange(10.)
    attr3  = 3.0*np.arange(10.)
    attr4  = 4.0
    attr5  = 5.0
    attr6  = 6.0
    
    ds_dt = np.dtype({'names':['attr1', 'attr2', 'attr3',
                               'attr4', 'attr5', 'attr6'],
                      'formats':[(float,10), (float,10), (float,10),
                                  float, float, float ] }) 
    
    with tb.File('SO_58674120_tb.h5','w') as h5f:
    
         tb1 = h5f.create_table('/','my_ds', description=ds_dt)
         for rcnt in range(1,100):
             data_list = [ (rcnt*attr1, rcnt*attr2, rcnt*attr3,
                            rcnt*attr4, rcnt*attr5, rcnt*attr6), ]
             tb1.append(data_list)
    

    You can do the same with h5py. The process is similar, but there are differences. For example, you have to size the dataset with shape=, and add maxshape= if you want to extend the dataset in the future. Also, I only know how to add data by referencing numpy arrays (not lists like PyTables). So I created recarr to hold the intermediate data. Again, if you have all your data, you don't have to load it row by row.

    See code below:

    import h5py
    import numpy as np
    
    attr1  = np.arange(10.)
    attr2  = 2.0*np.arange(10.)
    attr3  = 3.0*np.arange(10.)
    attr4  = 4.0
    attr5  = 5.0
    attr6  = 6.0
    
    ds_dt = np.dtype({'names':['attr1', 'attr2', 'attr3',
                               'attr4', 'attr5', 'attr6'],
                      'formats':[(float,10), (float,10), (float,10),
                                  float, float, float ] }) 
    recarr = np.empty((1,), dtype=ds_dt)
    
    with h5py.File('SO_58674120_h5.h5','w') as h5f:
    
         h5f.create_dataset('my_ds', dtype=ds_dt, shape=(100,), maxshape=(None) )
         for rcnt in range(1,100):
             recarr['attr1']= rcnt*attr1
             recarr['attr2']= rcnt*attr2
             recarr['attr3']= rcnt*attr3
             recarr['attr4']= rcnt*attr4
             recarr['attr5']= rcnt*attr5
             recarr['attr6']= rcnt*attr6
             h5f['my_ds'][rcnt] = recarr[0]