Search code examples
pythonhdf5h5py

Filling a hierarchical HDF5 dataset in a loop


What I am trying to do is take an existing HDF5 hierarchical dataset, select some of the elements, do an operation on them, and create a transformed file.

To read the file and do the operations, I am using a similar code

import h5py

with h5py.File(path, 'r') as fDFT:  # Loads the dataset and closes the file after the loop
        mol_ids = list(fDFT.keys()) # Main identifier of the data
        for molid in mol_ids: # Iterate over each data
                conf_ids = list( fDFT[molid].keys() ) 
                for confid in conf_ids: # Iterate over the configurations
                    if 'opt' in str( confid ): # Only select specific configurations
                        res1 = fDFT[molid][confid]['mPOL'][:] # Read an entry from the file
                        res2 = convert(fDFT[molid][confid]['atNUM'][:]) # Transform another entry

I can create the output HDF5 file and write the first identifier easily, since mol_ids is a variable saved, by adding

hdf5_output = h5py.File('restricted.hdf5','w')
hdf5_output['molid']=mol_ids

but this creates a label 'molid', which is not what I want (the labels should be the values within molid), and I am not sure how to extend this approach to the nested level anyways.

Is there a nice way to get a HDF5 file that follows the hierarchy of the for loop, that is, would create a data structure that can be called as hdf5_output[molid][confid]['mPOL'] for example?

I've tried the create_group() function, but this did not allow me to iteratively create a group.


Solution

  • I think you are confusing the string 'molid' with the variable name molid. Also, you need to create a group with the name. Likely you want is: hdf5_output.create_group(molid) (group name w/out quotes). As I understand your 2nd question, it would look something like this.

    # Opens the source dataset for reading and
    # creates the restricted dataset for writing
    # Closes the file when exiting the with/as context manager
    with h5py.File(path, 'r') as fDFT, \
         h5py.File('restricted.hdf5','w') as hdf5_output:
    
        for molid in fDFT: # Iterate over root level objects
            # Create a group named molid in 2nd H5 file:
            hdf5_output.create_group(molid)
            for confid in fDFT[molid]: # Iterate over the configuration datasets
                if 'opt' in confid: # Select configurations with 'opt' in dataset name
                    res1 = fDFT[molid][confid]['mPOL'][()] # Read an entry from the file
                    res2 = convert(fDFT[molid][confid]['atNUM'][()]) # Transform another entry
                    # The following code is my guess at the 2nd part (writing res/res2 data to the new file)
                    # This uses the h5py group.copy method to copy the res1 dataset - won't work for res2: 
                    # It is the easiest/preferred way to copy datasets as-is
                    fDFT.copy(fDFT[molid][confid]['mPOL'], hdf5_output[molid][confid]) 
    
                    # This writes your res1 and res2 arrays to the new file                    hdf5_output[molid][confid]['mPOL'] = res1    
                    hdf5_output[molid][confid]['mPOL'] = res1  
                    hdf5_output[molid][confid]['atNUM_converted'] = res2  
    
                    # Alternately, you could do this, and skip res1/res2 creation
                    hdf5_output[molid][confid]['mPOL'] = fDFT[molid][confid]['mPOL'][()]   
                    hdf5_output[molid][confid]['atNUM_converted'] = convert(fDFT[molid][confid]['atNUM'][()])  
    

    Notes:

    • I took the liberty to simplify the code by removing extraneous lines.
    • I combined file open operations into a single with/as context manager
    • You don't need .keys() to get the object names. Similar to dictionary syntax, a loop on an H5 object will yield the keys (group and/or dataset names). Also, you don't need to convert to a list.
    • The h5py group.copy() method is the simplest way to copy an entire dataset as-is. By default it uses source dataset compression and chunking parameters and copies attributes. I included an example for the 'mPOL' dataset. Complete description here.
    • Preferred way to extract the entire dataset as a NumPy array is ds_name[()] instead of ds_name[:]. Ref: h5py Reading & writing data