Search code examples
pythonhdf5h5py

Extracting Datasets from HDF5 File in Order Created


I have an HDF5 file I am trying to open with Python or MATLAB. The HDF5 file consists of several datasets all in the root folder, which were saved to the file in some order. I want to extract the datasets in the order they were written. I know that the order they were written is encoded in the HDF5 file, because when I open it with HDFView there is an "Object Ref" number associated with each dataset. These Object Ref IDs are lower for datasets that were written earlier / higher for datasets that are written later.

When I hope the file with Python (h5py package), the datasets are extracted in alphabetical order. I can't figure out any way to extract the Object Ref I see in HDFView to process in Python. Is there any way to extract the datasets in order in Python or MATLAB (or any other platform)?

This is the code I used in Python to get the datasets in alphabetical order

with h5py.File(file) as f:       
        keys = f.keys()
        for k in keys: print(k)

I'm looking for a way to do something like this

with h5py.File(file) as f:       
        keys = f.keys()
        object_refs = f.object_refs()
        indexes_in_sorted_order = object_refs.sorted_order() # pseudocode
        for i in indexes_in_sorted_order: print(keys[i])


Solution

  • @Homer512 is correct, h5py doesn't have an API to get that value. That said, you might be able to use the dataset's "offset" value. I did some limited testing for datasets that are NOT created in alphabetical order. The offset values appear to increase based on order of creation. To do this you have to use a low level API that references the DataSetID.

    Here is an example that creates 6 datasets that are not in alphabetical order. after creating, it loops over the datasets, creates a dictionary of [name]:offset, then reorders the dictionary based on the value. It loops over the names in the sorted dictionary to get the datasets in offset order. (You could also create a sorted list of the dataset names if you're not interested in the offset value.)

    Note: If you are going to do this frequently, I suggest adding creation time as a dataset attribute.

    See code below:

    ds_names = ['alpha', 'zebra', 'bravo', 'yankee', 'charlie', 'xray'] 
    cnt = 1
    with h5py.File('SO_75624797.h5','w') as h5f:
        for name in ds_names:
            h5f.create_dataset(name, data=np.arange(cnt,cnt+10))
            cnt += 10
    
    offset_dict = {}    
    with h5py.File('SO_75624797.h5') as h5f:
        for dset in h5f:
            print(f"for dset: {dset}, Offset: {h5f[dset].id.get_offset()}")
            offset_dict[dset] = h5f[dset].id.get_offset()
            
        offset_dict = {k: v for k, v in sorted(offset_dict.items(), key=lambda item: item[1])}
    
        print('')
        for dset in offset_dict:
            print(f"for dset: {dset}, Offset: {h5f[dset].id.get_offset()}")