I need to store a list/array of strings in an HDF5 file using h5py
. These strings are variable length. Following the examples I find online, I have a script that works.
import h5py
h5File=h5py.File('outfile.h5','w')
data=['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
However, when data
gets very large, the write takes a long time as it's looping over each entry and inserting it into the file.
I thought I could do it all in one line, just as I would with ints or floats. But the following script fails. Note that I added some code to test that int
works.
import h5py
h5File=h5py.File('outfile.h5','w')
data_numbers = [0, 1, 2, 3, 4]
data = ['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset_num = h5File.create_dataset('numbers',(len(data_numbers),1),dtype=int,data=data_numbers)
print("Created the dataset with numbers!\n")
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
print("Created the dataset with strings!\n")
h5File.flush()
h5File.close()
That script gives the following output.
Created the dataset with numbers!
Traceback (most recent call last):
File "write_strings_to_HDF5_file.py", line 32, in <module>
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 170, in make_new_dset
dset_id.write(h5s.ALL, h5s.ALL, data)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U8')
I've read the documentation about UTF-8 encoding and tried a number of variations on the above syntax but I seem to be missing some key point. Maybe it can't be done?
Thanks to anyone who has a suggestion!
If anyone wants to see the slowdown on the example that works, here's a test case.
import h5py
h5File=h5py.File('outfile.h5','w')
sentence=['this','is','a','sentence']
data = []
for i in range(10000):
data += sentence
print(len(data))
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
Writing data 1 row at a time is the slowest way to write to an HDF5 file. You won't notice the performance issue when you write 100 rows, but you will see it as the number of rows increases. There is another answer that discusses that issue. See this: pytables writes much faster than h5py. Why? (Note: I am NOT suggesting you use PyTables
. The linked answer shows performance for both h5py
and PyTables
). As you can see, it takes a lot longer longer to write the same amount of data when writing a lot of small chunks.
To improve performance, you need to write more data each time. Since you have all the data loaded in list data
, you can do it in one shot. It will be nearly instantaneous for 10,000 rows. The answer referenced in the comments touches on this technique (creating a np.array()
from the list data. However, it works from small lists (1/row)...so not exactly the same. You have to take care when you create the array. You can't use NumPy's default Unicode dtype -- it isn't supported by h5py
. Instead, you need dtype='S#'
Code below show show to convert your list of strings to a np.array()
of strings. Also, I highly recomend you use Python's with/as:
contect manager to open the file. This avoids situations where the file is accidentally left open due to an unexpected exit (due to crash or logic error).
Code below:
import h5py
import numpy as np
sentence=['this','is','a','sentence']
data = []
for i in range(10_000):
data += sentence
print(len(data))
longest_word=len(max(data, key=len))
print('longest_word=',longest_word)
dt = h5py.special_dtype(vlen=str)
arr = np.array(data,dtype='S'+str(longest_word))
with h5py.File('outfile.h5','w') as h5File:
dset = h5File.create_dataset('words',data=arr,dtype=dt)
print(dset.shape, dset.dtype)