Search code examples
arraylistcompressionhdf5h5pypytables

how to compress lists/nested lists in hdf5


I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files. I managed to try out a small list, since I do sometimes work with lists that have strings as follows;

def write():
    test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
    

    with  h5py.File('example_file.h5', 'w') as f:
        f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9) 
        f.close()
    

However I got this error:

f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)

After searching for hours over the net on any better ways to do this, I couldn't get. Is there a better way to compress lists with H5?


Solution

  • This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.

    h5py example
    h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings. You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'

    import h5py
    import numpy as np
    
    test_list = [['a01','a02','a03','a04','a05','a06'], 
                 ['a11','a12','a13','a14','a15','a16','a17'], 
                 ['a21','a22','a23','a24','a25','a26','a27','a28']]
    
    # arrlen and test_array from answer to SO #10346336 - Option 3:
    # Ref: https://stackoverflow.com/a/26224619/10462884    
    slen = max(len(item) for sublist in test_list for item in sublist)
    arrlen = max(map(len, test_list))
    test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
      
    with h5py.File('example_nested.h5', 'w') as f:
         f.create_dataset('test3', data=test_array, compression='gzip')
    

    PyTables example
    PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.

    import tables as tb
    import numpy as np
    
    test_list = [['a01','a02','a03','a04','a05','a06'], 
                 ['a11','a12','a13','a14','a15','a16','a17'], 
                 ['a21','a22','a23','a24','a25','a26','a27','a28']]
       
    slen = max(len(item) for sublist in test_list for item in sublist)
    
    with tb.File('example_nested_tb.h5', 'w') as h5f:        
        vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) ) 
        for slist in test_list:
            arr = np.array(slist,dtype='S'+str(slen))
            vlarray.append(arr)
    
        print('-->', vlarray.name)
        for row in vlarray:
            print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))