I have an HDF5 with 100 "events". Each event contains variable, but roughly 180 groups called "traces", and each trace has inside 6 datasets which are arrays of 32 bit floats, each ~1000 cells long (this carries slightly from event to event, but remains constant inside an event). The file was generated with default h5py settings (so no chunking or compression unless h5py does it on its own).
The readout is not fast. It is ~6 times slower than readout of the same data from CERN ROOT TTrees. I know that HDF5 is far from the fastest formats on the market, But I would be grateful, if you could tell me, where the speed is lost.
To read the arrays in traces I do:
d0keys = data["Run_0"].keys()
for key_1 in d0keys:
if("Event_" in key_1):
d1 = data["Run_0"][key_1]
d1keys = d1.keys()
for key_2 in d1keys:
if("Traces_" in key_2):
d2 = d1[key_2]
v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'][0],d2['SimSignal_Y'][0],d2['SimSignal_Z'][0],d2['SimEfield_X'][0], d2['SimEfield_Y'][0],d2['SimEfield_Z'][0]
Line profiler shows, that ~97% of the time is spent in the last line. Now, there are two issues:
So where does h5py loses its speed? I guess the pure readout should be just the HDD speed. Thus, is the bottleneck:
I would be grateful for some clues.
There are a lot of HDF5 I/O issues to consider. I will try to cover each.
From my tests, time spent doing I/O is primarily a function of the number of reads/writes and not how much data (in MB) you read/write. Read this SO post for more details:
pytables writes much faster than h5py. Why? Note: it shows I/O performance for a fixed amount of data with different I/O write sizes for both h5py and PyTables. Based on this, it makes sense that most of the time is spent in the last line -- that's where you are reading the data from disk to memory as NumPy arrays (v1, v2, v3, v4, v5, v6
).
Regarding your questions:
d2['SimSignal_X'][0]
and d2['SimSignal_X'][:]
. Both read the entire dataset into memory (all ~1000 dataset values). If you only want to read a slice of the data, you need to use slice notation. For example, d2['SimSignal_X'][0:100]
only reads the first 100 values (assumes d2['SimSignal_X']
only has a single axis -- shape=(1000,)
). Note; reading a slice will reduce required memory, but won't improve I/O read time. (In fact, reading slices will probably increase read time.)v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'], d2['SimSignal_Y'], d2['SimSignal_Z'], d2['SimEfield_X'], d2['SimEfield_Y'], d2['SimEfield_Z']
. Note how the slice notation is not used ([0]
or [:]
).