Search code examples
pythonhdf5h5pyhdf

How to split the data among the multiple groups in hdf5 file?


I have one some data which looks something like this:

Generated by trjconv : P/L=1/400 t=   0.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
11210LI     aLI11210  61.61  19.15   3.25
11211LI     aLI11211  69.99  64.64   3.17
11212LI     aLI11212  70.73  11.64   3.38
11213LI     aLI11213  62.67  16.16   3.44
11214LI     aLI11214   3.22   9.76   3.39
  61.42836  61.42836   8.47704

I have managed to write the data in the required groups, except the last line. I want to write this line to a group /particles/box. This specific line repeated in every frame as you can see in the data file here. So far the code have been designed in this ways that it somehow ignores this line. I tried some ways but getting the following error:

ValueError: Shape tuple is incompatible with data 

The last line is time dependent i.e., with every time frame it fluctuates a bit I want this data to be linked with step and time datasets which are defined already in the /particles/lipids/positions/step. Here is the code:

import struct
import numpy as np
import h5py
import re

# First part generate convert the .gro -> .h5 .
csv_file = 'com'

fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from

# Format for footer
fmtstring1 = '1s 1s 5s 7s 7s 7s'
fieldstruct1 = struct.Struct(fmtstring1)
parse1 = fieldstruct1.unpack_from

with open(csv_file, 'r') as f, \
    h5py.File('xaa_trial.h5', 'w') as hdf:
    # open group for position data
    ## Particles group with the attributes
    particles_grp = hdf.require_group('particles/lipids/positions')
    box_grp = particles_grp.create_group('box')
    dim_grp = box_grp.create_group('dimension')
    dim_grp.attrs['dimension'] = 3
    bound_grp = box_grp.create_group('boundary')
    bound_grp.attrs['boundary'] = ['periodic', 'periodic', 'periodic']
    edge_grp = box_grp.create_group('edges')
    edge_ds_time = edge_grp.create_dataset('time', dtype='f', shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
    edge_ds_step = edge_grp.create_dataset('step', dtype=np.uint64, shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
    edge_ds_value = None
    ## H5MD group with the attributes
    #hdf.attrs['version'] = 1.0 # global attribute
    h5md_grp = hdf.require_group('h5md/version/author/creator')
    h5md_grp.attrs['version'] = 1.0
    h5md_grp.attrs['author'] = 'rohit'
    h5md_grp.attrs['creator'] = 'known'
    
    # datasets with known sizes
    ds_time = particles_grp.create_dataset('time', dtype="f", shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
    ds_step = particles_grp.create_dataset('step', dtype=np.uint64, shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
    ds_value = None

    step = 0
    while True:
        header = f.readline()
        m = re.search("t= *(.*)$", header)
        if m:
            time = float(m.group(1))
        else:
            print("End Of File")
            break

        # get number of data rows, i.e., number of particles
        nparticles = int(f.readline())
        # read data lines and store in array
        arr = np.empty(shape=(nparticles, 3), dtype=np.float32)
        for row in range(nparticles):
            fields = parse( f.readline().encode('utf-8') )
            arr[row] = np.array((float(fields[3]), float(fields[4]), float(fields[5])))

        if nparticles > 0:
            # create a resizable dataset upon the first iteration
            if not ds_value:
                ds_value = particles_grp.create_dataset('value', dtype=np.float32,
                                                        shape=(0, nparticles, 3), maxshape=(None, nparticles, 3),
                                                        chunks=(1, nparticles, 3), compression='gzip', shuffle=True)
                #edge_data = bound_grp.create_dataset('box_size', dtype=np.float32, shape=(0, nparticles, 3), maxshape=(None, nparticles, 3), compression='gzip', shuffle=True)
            # append this sample to the datasets
            ds_time.resize(step + 1, axis=0)
            ds_step.resize(step + 1, axis=0)
            ds_value.resize(step + 1, axis=0)
            ds_time[step] = time
            ds_step[step] = step
            ds_value[step] = arr
  
        footer = parse1( f.readline().encode('utf-8') )
        dat = np.array(footer)
        print(dat)
        arr1 = np.empty(shape=(1, 3), dtype=np.float32)
        edge_data = bound_grp.create_dataset('box_size', data=dat, dtype=np.float32, compression='gzip', shuffle=True)
        
        step += 1
        #=============================================================================

Solution

  • Your code has a number of small errors when reading and converting the "footer" line. I modified the code and got it working....but not sure if it does exactly what you want. I used the same group and dataset definitions. So, the footer data is written to this data set:

    /particles/lipids/positions/box/boundary/box_size
    

    That comes from the following group and dataset definitions:

    particles_grp = hdf.require_group('particles/lipids/positions')
    box_grp = particles_grp.create_group('box')
    bound_grp = box_grp.create_group('boundary')
    edge_data = bound_grp.create_dataset('box_size'....
    

    Corrections are required in several places:
    First, you need to change the definition of parse1 to match the 3 fields.

    # Format for footer
    # FROM:
    fmtstring1 = '1s 1s 5s 7s 7s 7s'
    # TO:
    fmtstring1 = '10s 10s 10s'
    

    Next, you need to modify where and how the box_size dataset is created. You need to create it like the others: as an extensible dataset (maxshape=() parameter) ABOVE while True: loop. This is what I did:

    edge_ds_step = edge_grp.create_dataset('step', dtype=np.uint64, shape=(0,), maxshape=(None,), compression='gzip', shuffle=True)
    # Create empty 'box_size' dataset here
    edge_data = bound_grp.create_dataset('box_size', dtype=np.float32, shape=(0,3), maxshape=(None,3), compression='gzip', shuffle=True)
    

    Finally, here is the modified code to:

    1. Parse the footer string to a tuple,

    2. Map the tuple to an np.array of floats, shape=(1,3),

    3. Resize the dataset and finally

    4. Load the array into the dataset.

      footer = parse1( f.readline().encode('utf-8') )
      dat = np.array(footer).astype(float).reshape(1,3)
      new_size = edge_data.shape[0]+1
      edge_data.resize(new_size, axis=0)
      edge_data[new_size-1:new_size,:] = dat