Search code examples
pythonhdf5h5py

How to correct the spaces between the columns while reading a text file?


I want to read data from a text file and write it to the hdf5 format. But somehow in the middle of the data file the space between the columns vanishes. small part of the file The data looks something like this:

Generated by trjconv : P/L=1/400 t=   0.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
    3P2     aP2    3  18.53  -9.69   4.68
    4P2     aP2    4  55.39  74.34   4.60
    5P3     aP3    5  22.11  68.71   3.85
.
.
 9994LI     aLI 9994  24.60  41.14   5.32
 9995LI     aLI 9995  88.47  43.02   5.72
 9996LI     aLI 9996  18.98  40.60   5.56
 9997LI     aLI 9997  35.63  46.43   5.68
 9998LI     aLI 9998  33.81  52.15   5.41
 9999LI     aLI 9999  38.72  57.18   5.32
10000LI     aLI10000  29.36  47.12   5.55
10001LI     aLI10001  82.55  44.80   5.50
10002LI     aLI10002  42.52  51.00   5.19
10003LI     aLI10003  28.61  40.21   5.70
10004LI     aLI10004  38.16  42.85   5.33
Generated by trjconv : P/L=1/400 t=   1000.00
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
    3P2     aP2    3  18.53  -9.69   4.68
    4P2     aP2    4  55.39  74.34   4.60
    5P3     aP3    5  22.11  68.71   3.85
.
.
 9994LI     aLI 9994  24.60  41.14   5.32
 9995LI     aLI 9995  88.47  43.02   5.72
 9996LI     aLI 9996  18.98  40.60   5.56
 9997LI     aLI 9997  35.63  46.43   5.68
 9998LI     aLI 9998  33.81  52.15   5.41
 9999LI     aLI 9999  38.72  57.18   5.32
10000LI     aLI10000  29.36  47.12   5.55
10001LI     aLI10001  82.55  44.80   5.50
10002LI     aLI10002  42.52  51.00   5.19
10003LI     aLI10003  28.61  40.21   5.70
10004LI     aLI10004  38.16  42.85   5.33
..
..
..

The data is a collection of frames at t=1000 and has a million frames. As you can see in the end of frame the column 2 & 3 touches each other. I want to create the space between these while reading the data. The other issue I've with the repeated headers Generated by... How can I read and write them into the hdf5 file, because h5 file does not support the strings? Are there ways to add them manually? Here is the code:

import h5py
import numpy as np

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

# Next, create an empty .h5 file with the dtype
with h5py.File('xaa.h5', 'w') as hdf:
    ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,)) 

    # Next read line 1 of .gro file
    f = open('xaa', 'r')
    data = f.readlines()
    ds.attrs["Source"]=data[0]
    f.close()

    # loop to read rows from 2 until end
    skip, incr, row0 = 2, 20, 0 
    read_gro = True
    while read_gro:
        arr = np.genfromtxt('xaa', skip_header=skip, max_rows=incr, dtype=gro_dt)
        rows = arr.shape[0]
        if rows == 0:
            read_gro = False 
        else:    
            if row0+rows > ds.shape[0] :
                ds.resize((row0+rows,))
            ds[row0:row0+rows] = arr
            skip += rows
            row0 += rows

I can skip the first header, but how to deal with the coming headers? I can provide the line number of headers if someone need. The columns throws the valueError

 ValueError: Some errors were detected !
    Line #7 (got 5 columns instead of 6)
    Line #8 (got 5 columns instead of 6)
    Line #9 (got 5 columns instead of 6)

Solution

  • Answer updated 2021-09-09:
    Based on the request in the comments, I added 2 new methods that use f.readline(). One slices the line with indices, the other uses the struct package to unpack the fields. struct is supposed to be faster, but I did not see a significant difference in performance for the test file (75 time steps).

    Also, I modified the code to loop with while True: and break at the end of the file. This avoids the need to enter the number of time steps.

    This is an answer I wrote based on your problems with my answer to your previous question. (Ref: Reading data from gromacs file and write it to the hdf5 file format. This answer uses readlines() to read the data into a list. (That might be a problem with your large file. If so, the solution can be modified to read line-by-line with readline().) It slices the data on each line using indices aligned to the field widths. Warning: reading 50e6 lines might take awhile. Note: HDF5 supports strings (but h5py does not support NumPy Unicode strings).

    Method 1: Uses f.readlines() and processes list.
    Gets values by slicing each line with indices:

    import h5py
    import numpy as np
    
    csv_file = 'xaa.txt' # data from link in question
    
    # define a np.dtype for gro array/dataset (hard-coded for now)
    gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                       ('col4', float), ('col5', float), ('col6', float)])
     
    c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
    # The values above are used as indices to slice line 
    # into the following fields in the loop on data[]: 
    # [:7], [7:15], [15:20], [20:27], [27:34], [34:]
    
    # Open the file for reading and
    # create an empty .h5 file with the dtype above   
    with open(csv_file, 'r') as f, \
         h5py.File('xaa.h5', 'w') as hdf:
    
        data = f.readlines()
        skip = 0
        step = 0
        while True:
            # Read text header line for THIS time step
            if skip == len(data):
                print("End Of File")
                break
            else:
                header = data[skip]
                print(header)
                skip += 1
            
            # get number of data rows
            no_rows = int(data[skip]) 
            skip += 1
            arr = np.empty(shape=(no_rows,), dtype=gro_dt)
            for row, line in enumerate(data[skip:skip+no_rows]):
                arr[row]['col1'] = line[:c1].strip()            
                arr[row]['col2'] = line[c1:c2].strip()            
                arr[row]['col3'] = int(line[c2:c3])
                arr[row]['col4'] = float(line[c3:c4])
                arr[row]['col5'] = float(line[c4:c5])
                arr[row]['col6'] = float(line[c5:])
                
            if arr.shape[0] > 0:
                # create a dataset for THIS time step
                ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
                #create attributes for this dataset / time step
                hdr_tokens = header.split()
                ds.attrs['raw_header'] = header
                ds.attrs['Generated by'] = hdr_tokens[2]
                ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
                ds.attrs['Time'] = hdr_tokens[6]
                
            # increment by rows plus footer line that follows
            skip += 1 + no_rows
    

    Method 2: Uses f.readline() to read line-by-line.
    Gets values by slicing each line with indices:

    import h5py
    import numpy as np
    
    csv_file = 'xaa.txt'
    
    #define a np.dtype for gro array/dataset (hard-coded for now)
    gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                       ('col4', float), ('col5', float), ('col6', float)])
    
    ## gro_fmt=[0:7], [7:15], [15:20], [20:27], [27:34], [34:41]
    c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
    # Open the file for reading and
    # create an empty .h5 file with the dtype above   
    with open(csv_file, 'r') as f, \
         h5py.File('xaa.h5', 'w') as hdf:
    
        step = 0
        while True:
        # Read text header line for THIS time step
            header = f.readline()
            if not header:
                print("End Of File")
                break
            else:
                print(header)
    
            # get number of data rows
            no_rows = int(f.readline()) 
            arr = np.empty(shape=(no_rows,), dtype=gro_dt)
            for row in range(no_rows):
                line = f.readline()
                arr[row]['col1'] = line[:c1].strip()            
                arr[row]['col2'] = line[c1:c2].strip()            
                arr[row]['col3'] = int(line[c2:c3])
                arr[row]['col4'] = float(line[c3:c4])
                arr[row]['col5'] = float(line[c4:c5])
                arr[row]['col6'] = float(line[c5:])
                
            if arr.shape[0] > 0:
                # create a dataset for THIS time step
                ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
                #create attributes for this dataset / time step
                print(header)
                hdr_tokens = header.split()
                ds.attrs['raw_header'] = header
                ds.attrs['Generated by'] = hdr_tokens[2]
                ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
                ds.attrs['Time'] = hdr_tokens[6]
                
            footer = f.readline()
            step += 1
    

    Method 3: Uses f.readlines() to read line-by-line.
    Uses struct package to unpack values from each line:

    import struct
    import numpy as np
    import h5py
    
    csv_file = 'xaa.txt'
    
    fmtstring = '7s 8s 5s 7s 7s 7s'
    fieldstruct = struct.Struct(fmtstring)
    parse = fieldstruct.unpack_from
    
    #define a np.dtype for gro array/dataset (hard-coded for now)
    gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                       ('col4', float), ('col5', float), ('col6', float)])
    
    with open(csv_file, 'r') as f, \
         h5py.File('xaa.h5', 'w') as hdf:
             
        step = 0
        while True:         
            header = f.readline()
            if not header:
                print("End Of File")
                break
            else:
                print(header)
    
            # get number of data rows
            no_rows = int(f.readline())
            arr = np.empty(shape=(no_rows,), dtype=gro_dt)
            for row in range(no_rows):
                fields = parse( f.readline().encode('utf-8') )
                arr[row]['col1'] = fields[0].strip()            
                arr[row]['col2'] = fields[1].strip()            
                arr[row]['col3'] = int(fields[2])
                arr[row]['col4'] = float(fields[3])
                arr[row]['col5'] = float(fields[4])
                arr[row]['col6'] = float(fields[5])
            if arr.shape[0] > 0:
                # create a dataset for THIS time step
                ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
                #create attributes for this dataset / time step
                hdr_tokens = header.split()
                ds.attrs['raw_header'] = header
                ds.attrs['Generated by'] = hdr_tokens[2]
                ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
                ds.attrs['Time'] = hdr_tokens[6]
                
            footer = f.readline()
            step += 1