Reading data from gromacs file and write it to the hdf5 file format

I am trying to read data from .gro file line by line and want to write this to data to .h5 file format. But getting the Typeerror: "No conversion path ford type: type('<U7')". I guess the data read is in the string format. I tried to convert it to the arrays using np.arrays, but it does not work. Can anyone help me how to resolve the problem? Or are there better ways to read the data? I cannot use the np.loadtxt because the size of the data is around 50GB.

The format of the .gro file looks like this

Generated by trjconv : P/L=1/400 t=   0.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
Generated by trjconv : P/L=1/400 t=   10.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
Generated by trjconv : P/L=1/400 t=   20.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
Generated by trjconv : P/L=1/400 t=   30.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
Generated by trjconv : P/L=1/400 t=   40.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96

Error:

ValueError: Some errors were detected !
    Line #5 (got 7 columns instead of 6)
    Line #6 (got 1 columns instead of 6)
    Line #9 (got 7 columns instead of 6)
    Line #10 (got 1 columns instead of 6)
    Line #13 (got 7 columns instead of 6)
    Line #14 (got 1 columns instead of 6)
    Line #17 (got 7 columns instead of 6)
    Line #18 (got 1 columns instead of 6)

Here is my small code:

import h5py
import numpy as np
# First step is to read .gro file
f = open('pep.gro', 'r')
data = f.readlines()
for line in data:
    reading = line.split()
    #print(type(reading))
    #dat = np.array(reading).astype(int)

# Next step is to write the data to .h5 file
with h5py.File('pep1.h5', 'w') as hdf:
    hdf.create_dataset('dataset1', data=reading)

Solution

You start by creating the HDF5 dataset with a large number of rows [shape=(1_000_000)], and use the maxshape parameter to make it extensible. Value of maxshape=(None,) will allow unlimited rows. I defined a simple dtype to match your data. Creating a matching dtype for different file formats can be automated if needed.

You got the Unicode error because h5py doesn't support strings as Unicode data. (NumPy creates Unicode data from strings by default.) The work around to this limitation is defining a dtype for the array in advance (using 'S#' where NumPy has "<U".) You will use this dtype when you create the dataset AND when you read the data (see below).

Next use np.genfromtxt to read your directly into NumPy arrays. Use skip_header and max_rows parameters to read incrementally. Include the dtype parameter with the dtype used to create the dataset described above.

To test the incremental read, I extended your file to 54 lines (for 3 read loops). For performance reasons, you will want to use much larger values to read 50GB (set incr to what you can read into memory -- start with 100_000 lines).

Code below: (modified to skip first two lines

import h5py
import numpy as np

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

# Next, create an empty .h5 file with the dtype
with h5py.File('pep1.h5', 'w') as hdf:
    ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,)) 

    # Next read line 1 of .gro file
    f = open('pep.gro', 'r')
    data = f.readlines()
    ds.attrs["Source"]=data[0]
    f.close()

    # loop to read rows from 2 until end
    skip, incr, row0 = 2, 20, 0 
    read_gro = True
    while read_gro:
        arr = np.genfromtxt('pep.gro', skip_header=skip, max_rows=incr, dtype=gro_dt)
        rows = arr.shape[0]
        if rows == 0:
            read_gro = False 
        else:    
            if row0+rows > ds.shape[0] :
                ds.resize((row0+rows,))
            ds[row0:row0+rows] = arr
            skip += rows
            row0 += rows