I want to read data from a text file and write it to the hdf5 format. But somehow in the middle of the data file the space between the columns vanishes. small part of the file The data looks something like this:
Generated by trjconv : P/L=1/400 t= 0.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
.
.
9994LI aLI 9994 24.60 41.14 5.32
9995LI aLI 9995 88.47 43.02 5.72
9996LI aLI 9996 18.98 40.60 5.56
9997LI aLI 9997 35.63 46.43 5.68
9998LI aLI 9998 33.81 52.15 5.41
9999LI aLI 9999 38.72 57.18 5.32
10000LI aLI10000 29.36 47.12 5.55
10001LI aLI10001 82.55 44.80 5.50
10002LI aLI10002 42.52 51.00 5.19
10003LI aLI10003 28.61 40.21 5.70
10004LI aLI10004 38.16 42.85 5.33
Generated by trjconv : P/L=1/400 t= 1000.00
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
.
.
9994LI aLI 9994 24.60 41.14 5.32
9995LI aLI 9995 88.47 43.02 5.72
9996LI aLI 9996 18.98 40.60 5.56
9997LI aLI 9997 35.63 46.43 5.68
9998LI aLI 9998 33.81 52.15 5.41
9999LI aLI 9999 38.72 57.18 5.32
10000LI aLI10000 29.36 47.12 5.55
10001LI aLI10001 82.55 44.80 5.50
10002LI aLI10002 42.52 51.00 5.19
10003LI aLI10003 28.61 40.21 5.70
10004LI aLI10004 38.16 42.85 5.33
..
..
..
The data is a collection of frames at t=1000 and has a million frames. As you can see in the end of frame the column 2 & 3 touches each other. I want to create the space between these while reading the data. The other issue I've with the repeated headers Generated by... How can I read and write them into the hdf5 file, because h5 file does not support the strings? Are there ways to add them manually? Here is the code:
import h5py
import numpy as np
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
# Next, create an empty .h5 file with the dtype
with h5py.File('xaa.h5', 'w') as hdf:
ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,))
# Next read line 1 of .gro file
f = open('xaa', 'r')
data = f.readlines()
ds.attrs["Source"]=data[0]
f.close()
# loop to read rows from 2 until end
skip, incr, row0 = 2, 20, 0
read_gro = True
while read_gro:
arr = np.genfromtxt('xaa', skip_header=skip, max_rows=incr, dtype=gro_dt)
rows = arr.shape[0]
if rows == 0:
read_gro = False
else:
if row0+rows > ds.shape[0] :
ds.resize((row0+rows,))
ds[row0:row0+rows] = arr
skip += rows
row0 += rows
I can skip the first header, but how to deal with the coming headers? I can provide the line number of headers if someone need. The columns throws the valueError
ValueError: Some errors were detected !
Line #7 (got 5 columns instead of 6)
Line #8 (got 5 columns instead of 6)
Line #9 (got 5 columns instead of 6)
Answer updated 2021-09-09:
Based on the request in the comments, I added 2 new methods that use f.readline()
. One slices the line with indices, the other uses the struct
package to unpack the fields. struct
is supposed to be faster, but I did not see a significant difference in performance for the test file (75 time steps).
Also, I modified the code to loop with while True:
and break at the end of the file. This avoids the need to enter the number of time steps.
This is an answer I wrote based on your problems with my answer to your previous question. (Ref: Reading data from gromacs file and write it to the hdf5 file format. This answer uses readlines()
to read the data into a list. (That might be a problem with your large file. If so, the solution can be modified to read line-by-line with readline()
.) It slices the data on each line using indices aligned to the field widths. Warning: reading 50e6 lines might take awhile. Note: HDF5 supports strings (but h5py does not support NumPy Unicode strings).
Method 1: Uses f.readlines()
and processes list.
Gets values by slicing each line with indices:
import h5py
import numpy as np
csv_file = 'xaa.txt' # data from link in question
# define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# The values above are used as indices to slice line
# into the following fields in the loop on data[]:
# [:7], [7:15], [15:20], [20:27], [27:34], [34:]
# Open the file for reading and
# create an empty .h5 file with the dtype above
with open(csv_file, 'r') as f, \
h5py.File('xaa.h5', 'w') as hdf:
data = f.readlines()
skip = 0
step = 0
while True:
# Read text header line for THIS time step
if skip == len(data):
print("End Of File")
break
else:
header = data[skip]
print(header)
skip += 1
# get number of data rows
no_rows = int(data[skip])
skip += 1
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row, line in enumerate(data[skip:skip+no_rows]):
arr[row]['col1'] = line[:c1].strip()
arr[row]['col2'] = line[c1:c2].strip()
arr[row]['col3'] = int(line[c2:c3])
arr[row]['col4'] = float(line[c3:c4])
arr[row]['col5'] = float(line[c4:c5])
arr[row]['col6'] = float(line[c5:])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
# increment by rows plus footer line that follows
skip += 1 + no_rows
Method 2: Uses f.readline()
to read line-by-line.
Gets values by slicing each line with indices:
import h5py
import numpy as np
csv_file = 'xaa.txt'
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
## gro_fmt=[0:7], [7:15], [15:20], [20:27], [27:34], [34:41]
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# Open the file for reading and
# create an empty .h5 file with the dtype above
with open(csv_file, 'r') as f, \
h5py.File('xaa.h5', 'w') as hdf:
step = 0
while True:
# Read text header line for THIS time step
header = f.readline()
if not header:
print("End Of File")
break
else:
print(header)
# get number of data rows
no_rows = int(f.readline())
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row in range(no_rows):
line = f.readline()
arr[row]['col1'] = line[:c1].strip()
arr[row]['col2'] = line[c1:c2].strip()
arr[row]['col3'] = int(line[c2:c3])
arr[row]['col4'] = float(line[c3:c4])
arr[row]['col5'] = float(line[c4:c5])
arr[row]['col6'] = float(line[c5:])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
print(header)
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
footer = f.readline()
step += 1
Method 3: Uses f.readlines()
to read line-by-line.
Uses struct
package to unpack values from each line:
import struct
import numpy as np
import h5py
csv_file = 'xaa.txt'
fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
with open(csv_file, 'r') as f, \
h5py.File('xaa.h5', 'w') as hdf:
step = 0
while True:
header = f.readline()
if not header:
print("End Of File")
break
else:
print(header)
# get number of data rows
no_rows = int(f.readline())
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row in range(no_rows):
fields = parse( f.readline().encode('utf-8') )
arr[row]['col1'] = fields[0].strip()
arr[row]['col2'] = fields[1].strip()
arr[row]['col3'] = int(fields[2])
arr[row]['col4'] = float(fields[3])
arr[row]['col5'] = float(fields[4])
arr[row]['col6'] = float(fields[5])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
footer = f.readline()
step += 1