Search code examples
pythontextdictionarynumpyreadlines

Python (numpy) read a text file with mixed format


I have thousands of files like this, and I want to extract the values of columns 6,7,8 for the rows corresponding to atoms ['CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ'],

ATOM      1  CG  TOLU    1      -0.437  -0.756   1.802  1.00  1.99      PRO0
ATOM      2  HG  TOLU    1      -0.689  -1.123   2.786  1.00  0.00      PRO0
ATOM      3  CD1 TOLU    1       0.041  -1.623   0.811  1.00  1.99      PRO0
ATOM      4  HD1 TOLU    1       0.331  -2.603   1.162  1.00  0.00      PRO0
ATOM      5  CD2 TOLU    1      -0.692   0.547   1.352  1.00  1.99      PRO0
ATOM      6  HD2 TOLU    1      -1.131   1.264   2.030  1.00  0.00      PRO0
ATOM      7  CE1 TOLU    1       0.246  -1.276  -0.504  1.00  1.99      PRO0
ATOM      8  HE1 TOLU    1       0.596  -2.073  -1.144  1.00  0.00      PRO0
ATOM      9  CE2 TOLU    1      -0.331   0.991   0.063  1.00  1.99      PRO0
ATOM     10  HE2 TOLU    1      -0.565   2.030  -0.117  1.00  0.00      PRO0
ATOM     11  CZ  TOLU    1       0.136   0.076  -0.919  1.00  1.99      PRO0
ATOM     12  CT  TOLU    1       0.561   0.474  -2.282  1.00  0.00      PRO0
ATOM     13  H11 TOLU    1       0.529  -0.410  -2.955  1.00  0.00      PRO0
ATOM     14  H12 TOLU    1       1.574   0.930  -2.294  1.00  0.00      PRO0
ATOM     15  H13 TOLU    1      -0.203   1.165  -2.699  1.00  0.00      PRO0
ATOM     16  CG  TOLU    2       5.140   1.762  -1.390  1.00  1.99      PRO0
ATOM     17  HG  TOLU    2       5.815   1.717  -2.231  1.00  0.00      PRO0
ATOM     18  CD1 TOLU    2       4.578   0.647  -0.862  1.00  1.99      PRO0
ATOM     19  HD1 TOLU    2       4.835  -0.329  -1.246  1.00  0.00      PRO0
ATOM     20  CD2 TOLU    2       4.786   3.044  -0.824  1.00  1.99      PRO0
ATOM     21  HD2 TOLU    2       5.184   3.982  -1.181  1.00  0.00      PRO0
ATOM     22  CE1 TOLU    2       3.734   0.667   0.248  1.00  1.99      PRO0
ATOM     23  HE1 TOLU    2       3.131  -0.167   0.574  1.00  0.00      PRO0
ATOM     24  CE2 TOLU    2       4.042   3.068   0.321  1.00  1.99      PRO0
ATOM     25  HE2 TOLU    2       3.753   3.969   0.841  1.00  0.00      PRO0
ATOM     26  CZ  TOLU    2       3.465   1.886   0.893  1.00  1.99      PRO0
ATOM     27  CT  TOLU    2       2.501   1.806   2.157  1.00  0.00      PRO0
ATOM     28  H11 TOLU    2       2.361   0.712   2.283  1.00  0.00      PRO0
ATOM     29  H12 TOLU    2       1.490   2.181   1.890  1.00  0.00      PRO0
ATOM     30  H13 TOLU    2       2.845   2.513   2.943  1.00  0.00      PRO0
TER
END

and notice that there exist two rows for each of the mentioned atoms. Therefore, I think two dictionaries with 12 keys would best fit my goal, like this

{1: {'CG':(0,0,0), 'CD1':(0,0,0), 'CD2':(0,0,0), 'CE1':(0,0,0), 'CE2':(0,0,0), 'CZ':(0,0,0)},
2: {'CG':(0,0,0), 'CD1':(0,0,0), 'CD2':(0,0,0), 'CE1':(0,0,0), 'CE2':(0,0,0), 'CZ':(0,0,0)}}

Where the first keys (1, 2) refer to the 5th column.

Can you tell me a robust way to read the file and assign each tuple of values to its correct place in the dictionary? I can do it with multiple if conditions, but I thought there must be a better way (maybe with numpy)


Solution

  • This will do the work:

    atmlist = ['CG', 'CD1', 'CD2', 'CE1', 'CE2', 'CZ']
    def Read_PDB(filename):
       coord={r:{k:(0,0,0) for k in atmlist} for r in [0,1]}
       try:
          f = open(filename, 'r')
       except IOError as err:
          print ("I/O error({0}): {1}".format(err.errno, err.strerror))
          quit()
    
       for line in f:
          for at in atmlist:
             if (line.find(at) == 13):
                line = line.strip()
                temp = line.split()
                crd = (float(temp[5]), float(temp[6]), float(temp[7]))
                coord[int(temp[4])-1][at] = crd;
    
       return coord`