I am trying to read data from .gro file line by line and want to write this to data to .h5 file format. But getting the Typeerror: "No conversion path ford type: type('<U7')"
. I guess the data read is in the string format. I tried to convert it to the arrays using np.arrays, but it does not work. Can anyone help me how to resolve the problem? Or are there better ways to read the data? I cannot use the np.loadtxt
because the size of the data is around 50GB.
The format of the .gro file looks like this
Generated by trjconv : P/L=1/400 t= 0.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
Generated by trjconv : P/L=1/400 t= 10.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
Generated by trjconv : P/L=1/400 t= 20.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
Generated by trjconv : P/L=1/400 t= 30.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
Generated by trjconv : P/L=1/400 t= 40.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
Error:
ValueError: Some errors were detected !
Line #5 (got 7 columns instead of 6)
Line #6 (got 1 columns instead of 6)
Line #9 (got 7 columns instead of 6)
Line #10 (got 1 columns instead of 6)
Line #13 (got 7 columns instead of 6)
Line #14 (got 1 columns instead of 6)
Line #17 (got 7 columns instead of 6)
Line #18 (got 1 columns instead of 6)
Here is my small code:
import h5py
import numpy as np
# First step is to read .gro file
f = open('pep.gro', 'r')
data = f.readlines()
for line in data:
reading = line.split()
#print(type(reading))
#dat = np.array(reading).astype(int)
# Next step is to write the data to .h5 file
with h5py.File('pep1.h5', 'w') as hdf:
hdf.create_dataset('dataset1', data=reading)
You start by creating the HDF5 dataset with a large number of rows [shape=(1_000_000)
], and use the maxshape
parameter to make it extensible. Value of maxshape=(None,)
will allow unlimited rows. I defined a simple dtype to match your data. Creating a matching dtype for different file formats can be automated if needed.
You got the Unicode error because h5py doesn't support strings as Unicode data. (NumPy creates Unicode data from strings by default.) The work around to this limitation is defining a dtype for the array in advance (using 'S#' where NumPy has "<U".) You will use this dtype when you create the dataset AND when you read the data (see below).
Next use np.genfromtxt
to read your directly into NumPy arrays. Use skip_header
and max_rows
parameters to read incrementally. Include the dtype
parameter with the dtype used to create the dataset described above.
To test the incremental read, I extended your file to 54 lines (for 3 read loops). For performance reasons, you will want to use much larger values to read 50GB (set incr
to what you can read into memory -- start with 100_000 lines).
Code below: (modified to skip first two lines
import h5py
import numpy as np
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
# Next, create an empty .h5 file with the dtype
with h5py.File('pep1.h5', 'w') as hdf:
ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,))
# Next read line 1 of .gro file
f = open('pep.gro', 'r')
data = f.readlines()
ds.attrs["Source"]=data[0]
f.close()
# loop to read rows from 2 until end
skip, incr, row0 = 2, 20, 0
read_gro = True
while read_gro:
arr = np.genfromtxt('pep.gro', skip_header=skip, max_rows=incr, dtype=gro_dt)
rows = arr.shape[0]
if rows == 0:
read_gro = False
else:
if row0+rows > ds.shape[0] :
ds.resize((row0+rows,))
ds[row0:row0+rows] = arr
skip += rows
row0 += rows