Search code examples
pythonarrayspandasnumpyscientific-computing

How to read in file with multiple coordinates and store in separate arrays?


I have a file that has about 3000 blocks repeated as follows (first two shown):

    21
Profile.   1 HEAT OF FORMATION =   -79.392 KCAL =   -332.175 KJ
    H       -2.22728       -1.35263        1.32579
    H        1.21425       -1.35263        1.32579
    C        1.43878        0.44129        1.32579
    O        2.25748       -0.52202        1.23773
    C        0.12570       -0.10907        1.38542
    H       -0.47394        0.10034        2.26424
    C       -2.02530       -1.28825       -2.05204
    C       -0.80697       -0.63466       -2.22403
    H       -0.41632       -0.42983       -3.21532
    H        0.84731        0.28355       -1.21782
    C       -0.09866       -0.24043       -1.09182
    C       -1.83256       -1.15779        0.32994
    C       -0.59706       -0.50055        0.19091
    H       -3.51151       -2.06378       -0.69513
    C       -2.55421       -1.55647       -0.78456
    O       -2.78665       -1.71220       -3.09841
    H       -2.37922       -1.48635       -3.96745
    H        2.21062        3.22762        2.75985
    C        1.91952        1.85374        1.37731
    O        2.22890        2.54529        0.44919
    O        1.92486        2.27899        2.65936
    21
Profile.   2 HEAT OF FORMATION =   -79.390 KCAL =   -332.168 KJ
    H       -2.22728       -1.35263        1.32579
    H        1.21674       -1.35282        1.32529
    C        1.43862        0.44132        1.32582
    O        2.25745       -0.52214        1.23772
    C        0.12565       -0.10889        1.38540
    H       -0.47402        0.10051        2.26417
    C       -2.02530       -1.28825       -2.05204
    C       -0.80697       -0.63465       -2.22403
    H       -0.41632       -0.42983       -3.21531
    H        0.84730        0.28355       -1.21782
    C       -0.09865       -0.24043       -1.09182
    C       -1.83256       -1.15780        0.32995
    C       -0.59702       -0.50058        0.19094
    H       -3.51151       -2.06378       -0.69513
    C       -2.55421       -1.55647       -0.78456
    O       -2.78666       -1.71220       -3.09841
    H       -2.37922       -1.48635       -3.96745
    H        2.21061        3.22763        2.75985
    C        1.91953        1.85373        1.37732
    O        2.22890        2.54528        0.44919
    O        1.92486        2.27898        2.65936

where each set of coordinates is separated by the number of atoms (21 in this case) and a heat of formation.

I'm wondering how to go about reading in and writing each set of coordinates to a separate array so that I can ultimately manipulate certain elements of these arrays.


Solution

  • As I suggested in the comment:

    Read all the lines:

    In [783]: with open('stack40730696.txt','rb') as f:
         ...:     lines = f.readlines()
    

    I could read line by line, block by block,etc. but it's easiest to play with a list.

    Now read the 1st block. Looks like '21' is the number of data lines:

    In [784]: i=0
    In [785]: n=int(lines[i])
    In [786]: n
    Out[786]: 21
    In [787]: i+=1
    In [788]: block=lines[i:i+1+n]   # grab the lines of a block, with header
    In [789]: block[0]
    Out[789]: b'Profile.   1 HEAT OF FORMATION =   -79.392 KCAL =   -332.175 KJ\n'
    In [790]: block[-1]
    Out[790]: b'    O        1.92486        2.27899        2.65936\n'
    

    Now load into an array with genfromtxt; and check the result. I printed the whole thing, but here I'll just print a few details.

    In [791]: data1=np.genfromtxt(block, skip_header=1,dtype=None)
    In [792]: data1.shape
    Out[792]: (21,)
    In [793]: data1.dtype
    Out[793]: dtype([('f0', 'S1'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])
    

    Advance to the next block and repeat. Of course for the whole file I'd put this in a loop, and collect the data arrays in a list.

    In [794]: i=i+1+n
    In [795]: n=int(lines[i])
    In [796]: i+=1
    In [797]: block=lines[i:i+1+n]
    In [798]: data2=np.genfromtxt(block, skip_header=1,dtype=None)
    In [799]: data2.shape
    Out[799]: (21,)
    In [800]: data2.dtype
    Out[800]: dtype([('f0', 'S1'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])
    

    a few records

    In [802]: data2[:3]
    Out[802]: 
    array([(b'H', -2.22728, -1.35263, 1.32579),
           (b'H', 1.21674, -1.35282, 1.32529),
           (b'C', 1.43862, 0.44132, 1.32582)], 
          dtype=[('f0', 'S1'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])
    

    This is a structured array with one string field, and 3 float ones. This could be split into a string array and a (21,3) float array

    In [803]: dataf=np.genfromtxt(block, skip_header=1,usecols=[1,2,3])
    In [804]: dataf.shape
    Out[804]: (21, 3)
    In [805]: dataf.dtype
    Out[805]: dtype('float64')