Search code examples
pythonhdf5h5py

Where it the h5py performance bottleneck?


I have an HDF5 with 100 "events". Each event contains variable, but roughly 180 groups called "traces", and each trace has inside 6 datasets which are arrays of 32 bit floats, each ~1000 cells long (this carries slightly from event to event, but remains constant inside an event). The file was generated with default h5py settings (so no chunking or compression unless h5py does it on its own).

The readout is not fast. It is ~6 times slower than readout of the same data from CERN ROOT TTrees. I know that HDF5 is far from the fastest formats on the market, But I would be grateful, if you could tell me, where the speed is lost.

To read the arrays in traces I do:

    d0keys = data["Run_0"].keys()
    for key_1 in d0keys:
        if("Event_" in key_1):
            d1 = data["Run_0"][key_1]
            d1keys = d1.keys()
            for key_2 in d1keys:
                if("Traces_" in key_2):
                    d2 = d1[key_2]
                    v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'][0],d2['SimSignal_Y'][0],d2['SimSignal_Z'][0],d2['SimEfield_X'][0], d2['SimEfield_Y'][0],d2['SimEfield_Z'][0]

Line profiler shows, that ~97% of the time is spent in the last line. Now, there are two issues:

  1. It seems there is no difference between reading cell [0] and all the ~1000 cells with [:]. I understand that h5py should be able to read just a chunk of data from the disk. why no difference?
  2. Reading 100 events from HDD (Linux, ext4) takes ~30 s with h5py, and ~5 s with ROOT. The size of 100 events is roughly 430 MB. This gives readout speed in HDF of ~14 MBps, while ROOT is ~86 MBps. Both slow, but ROOT comes much closer to the raw readout speed that I would expect from ~4 yo laptop HDD.

So where does h5py loses its speed? I guess the pure readout should be just the HDD speed. Thus, is the bottleneck:

  1. Dereferencing HDF5 address to the dataset (ROOT does not need to do it)?
  2. Allocating memory in python?
  3. Something else?

I would be grateful for some clues.


Solution

  • There are a lot of HDF5 I/O issues to consider. I will try to cover each.

    From my tests, time spent doing I/O is primarily a function of the number of reads/writes and not how much data (in MB) you read/write. Read this SO post for more details: pytables writes much faster than h5py. Why? Note: it shows I/O performance for a fixed amount of data with different I/O write sizes for both h5py and PyTables. Based on this, it makes sense that most of the time is spent in the last line -- that's where you are reading the data from disk to memory as NumPy arrays (v1, v2, v3, v4, v5, v6).

    Regarding your questions:

    1. There's a reason there is no difference between reading d2['SimSignal_X'][0] and d2['SimSignal_X'][:]. Both read the entire dataset into memory (all ~1000 dataset values). If you only want to read a slice of the data, you need to use slice notation. For example, d2['SimSignal_X'][0:100] only reads the first 100 values (assumes d2['SimSignal_X'] only has a single axis -- shape=(1000,)). Note; reading a slice will reduce required memory, but won't improve I/O read time. (In fact, reading slices will probably increase read time.)
    2. I am not familiar with CERN ROOT, so can't comment about performance of h5py vs ROOT. If you want to use Python, here are several things to consider:
      • You are reading the HDF5 data into memory (as a NumPy array). You don't have to do that. Instead of creating arrays, you can create h5py dataset objects. This reduces the I/O initially. Then you use the objects "as-if" they are np.arrays. The only change is your last line of code -- like this: v1, v2, v3, v4, v5, v6 = d2['SimSignal_X'], d2['SimSignal_Y'], d2['SimSignal_Z'], d2['SimEfield_X'], d2['SimEfield_Y'], d2['SimEfield_Z']. Note how the slice notation is not used ([0] or [:]).
      • I have found PyTables to be faster than h5py. Your code can be easily converted to read with PyTables.
      • HDF5 can use "chunking" to improve I/O performance. You did not say if you are using this. It might help (hard to say since your datasets aren't very large). You have to define this when you create the HDF5 file, so might not be useful in your case.
      • Also you did not say if a compression filter was used when the data was written. This reduces the on-disk file size, but has the side effect of reducing I/O performance (increasing read times). If you don't know, check to see if compression was enabled at file creation.