Search code examples
pythonhdf5h5py

Slow data retrieval in h5py dataset of key-value strings


Given the following h5py file root -> group1 -> million key,val pairs:
Retrieving any number of datasets (Even 1) takes extremely long time (~10 seconds), and I wonder if I could insert them differently to control that behavior (I assume that the caching is too big for my use-case, but the default size is 1MB)
The behavior is as follows:

script A

hdf5 = h5py.File(path_to_h5py, libver='latest',mode='a')
hdf5_dataet = hdf5.create_group(name_of_dataset)
for key, val in tqdm(dataset.items()):
    hdf5_dataet.create_dataset(json.dumps(key),data=json.dumps(val))

script B

f = h5py.File(path_to_h5py,'r')
data = f[name_of_dataset]
key_example = next(data.__iter__()) ---------> This takes 10 seconds

Solution

  • HDF5 doesn't use key/value pairs like a Python dictionary. The data structures are more like NumPy arrays. I don't know what you are ultimately trying to do. There is a far simpler iterator for Script B. Try this:

    h5f = h5py.File(path_to_h5py,'r')
    data = h5f[name_of_dataset]
    for key_example in data:
        print (key_example)
    

    Simple test added on 2020-04-25 to check I/O performance:

    import h5py
    import time
    
    upper = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    lower = 'abcdefghijklmnopqrstuvwxyz'
    nums = '0123456789'
    
    with h5py.File('SO_61417130.h5','w') as h5w:
    
        nrows = 16
        nrpts = 100
    
        #vstr_dt = h5py.string_dtype(encoding='utf-8') # for h5py 2.10.0
        vstr_dt = h5py.special_dtype(vlen=str)   # for h5py 2.9.0
        vstr_ds = h5w.create_dataset('testvstrs', (nrpts*nrows,1), dtype=vstr_dt )
        print (vstr_ds.dtype, ":", vstr_ds.shape)    
    
        rcnt = 0
        for cnt1 in range(nrpts) :
            for cnt2 in range(nrows) :
                vstr_ds[rcnt]= ((cnt2+1)*(upper+lower+nums))
                rcnt +=1
    
        print (vstr_ds.dtype, ":", vstr_ds.shape)    
        print ('done writing')
    
        start = time.clock()
        for cnt in range(-nrows,0,1) :
            find_str = vstr_ds[cnt] 
            print (len(find_str[0]))
    
        print ('Elapsed time =', (time.clock() - start) )