Search code examples
pythonhdf5h5py

Search a HDF5 File for a specific Group- or Dataset-Name, to get its Path


I have a HDF5 file with several Datasets in Groups and sub-Groups. I want to get the Path to a specific Group or Dataset by its name.

A good way to do it is shown in the HDF5 Documentation: https://docs.h5py.org/en/stable/high/group.html

with h5py.File('File.hdf5','r') as hf:    
    def find_Name (hf):
        if 'Name' in hf:
            return hf
    
    hf.visit(find_Name)
    
>>>'Group/subGroup/Name'

The Problem with the solution is, that I can not change the "Name" of the Dataset/Group with each call of hf.visit(find_Name)

How can I define a new String, the function is searching for, with every call?

The following did not work:

with h5py.File('File.hdf5','r') as hf: 
    def find_Name (hf,Name):
        if Name in hf:
            return hf
    
    Name = 'NameOfDataset'
    hf.visit(find_Name(hf,Name))

Thank you for your support!


Solution

  • Your second attempt fails because the h5py visitor functions (.visit() and .visititems() ) don't support additional keywords. (In other words, you can't add Name to the function's argument list.) However, creating a function to mimic their recursive behavior isn't complicated -- you just need to recursively call the same function to descend into any groups you find. (PyTables has functions that support this...but I will save that for a different answer if you are interested.) Take a look at this code:

    with h5py.File('File.hdf5','r') as hf: 
        
        def check_Name(Name,grp,prefix=''):
            for obj_name, obj in grp.items():
                path = f'{prefix}/{obj_name}'
                if Name == obj_name:
                    return path
                elif isinstance(obj, h5py.Group): # test for group (go down)
                    gpath = check_Name(Name, obj, path)
                    if gpath:
                        return gpath
    
        Name = 'Name'
        path = check_Name(Name,hf)
        if path:
            print(f'{Name} Found: {path}')
        else:    
            print(f'{Name} not Found')
           
        Name = 'NameOfDataset'
        path = check_Name(Name,hf)
        if path:
            print(f'{Name} Found: {path}')
        else:    
            print(f'{Name} not Found')
    

    Output for your schema is:

    Name Found: /Group/subGroup/Name
    NameOfDataset not Found
    

    Note: This only returns the first occurrence of the input Name. You will need to convert to a generator function if you need to find multiple occurrences. (FYI, using .visit() has the same limitation -- it exits on the return.)