Search code examples
pythonbioinformaticsbiopythonpdbpdb-files

Parsing a .pdb file in Python and creating a dictionary for specific record types


First, let me start by saying that I'm doing this as a Python exercise and I'm not allowed to use Biopython.

I am writing a script that will help me parse any .pdb file generated from a trajectory. I am trying to create a dictionary that would link the chain variable with the resNumber variable. Although I solved the issue for a specific .pdb file, which only has 2 chains, I would like to make this script work for any .pdb file, no matter the number of chains. Here is what I wrote:

import sys

pdbTraj = open('md20_aligned_3frames.pdb', 'r')
pdbTraj_line = pdbTraj.readlines()
newFile = open('newfile.txt', 'w')
pdbDict = {}
resNumberList1 = []
resNumberList2 = []
chainTry = "A"
for line in pdbTraj_line:
    if line.startswith(("ATOM" or "HETATM")):
        atomType = line[0:6]
        atomSerialNumber = line[6:11]
        atomName = line[12:16]
        resName = line[17:20]
        chain = line[21]
        resNumber = line[22:26]
        coorX = line[30:38]
        coorY = line[38:46]
        coorZ = line[46:54]
        occupancy = line[54:60]
        temperatureFact = line[60:66]
        segmentIdentifier = line[72:76]
        elementSymbol = line[76:78]
        if chain == chainTry:
            resNumberList1.append(resNumber)
            pdbDict[chain] = list(dict.fromkeys(resNumberList1))
        else:
            resNumberList2.append(resNumber)
            pdbDict[chain] = list(dict.fromkeys(resNumberList2))

print(pdbDict)

This is the result I get:

{'A': ['   1', '   2', '   3', '   4', '   5', '   6', '   7', '   8', '   9', '  10', '  11', '  12', '  13', '  14', '  15', '  16', '  17'], 'B': ['  19', '  20', '  21', '  22', '  23', '  24', '  25', '  26', '  27', '  28', '  29', '  30', '  31', '  32', '  33', '  34', '  35', '  36', '  37', '  38', '  39', '  40', '  41', '  42', '  43', '  44', '  45', '  46', '  47', '  48', '  49', '  50', '  51', '  52', '  53', '  54', '  55', '  56', '  57', '  58', '  59', '  60', '  61', '  62', '  63', '  64', '  65', '  66', '  67', '  68', '  69', '  70', '  71', '  72', '  73', '  74', '  75', '  76', '  77', '  78', '  79', '  80', '  81', '  82', '  83', '  84', '  85', '  86', '  87', '  88', '  89', '  90', '  91', '  92', '  93', '  94', '  95', '  96', '  97', '  98', '  99', ' 100', ' 101', ' 102', ' 103', ' 104', ' 105', ' 106', ' 107', ' 108', ' 109', ' 110', ' 111', ' 112', ' 113', ' 114', ' 115', ' 116', ' 117', ' 118', ' 119', ' 120', ' 121', ' 122', ' 123', ' 124', ' 125', ' 126', ' 127', ' 128', ' 129', ' 130', ' 131', ' 132', ' 133', ' 134', ' 135', ' 136', ' 137', ' 138', ' 139', ' 140', ' 141', ' 142', ' 143', ' 144', ' 145', ' 146', ' 147', ' 148', ' 149', ' 150', ' 151', ' 152', ' 153', ' 154', ' 155', ' 156', ' 157', ' 158', ' 159', ' 160', ' 161', ' 162', ' 163', ' 164', ' 165', ' 166', ' 167', ' 168', ' 169', ' 170', ' 171', ' 172', ' 173', ' 174', ' 175', ' 176', ' 177', ' 178', ' 179', ' 180', ' 181', ' 182', ' 183', ' 184', ' 185', ' 186', ' 187', ' 188', ' 189', ' 190', ' 191', ' 192', ' 193', ' 194', ' 195', ' 196', ' 197', ' 198', ' 199', ' 200', ' 201', ' 202', ' 203', ' 204', ' 205', ' 206', ' 207', ' 208', ' 209', ' 210', ' 211', ' 212', ' 213', ' 214', ' 215', ' 216', ' 217', ' 218', ' 219', ' 220', ' 221', ' 222', ' 223', ' 224', ' 225', ' 226', ' 227', ' 228', ' 229', ' 230', ' 231', ' 232', ' 233', ' 234', ' 235', ' 236', ' 237', ' 238', ' 239', ' 240', ' 241', ' 242', ' 243', ' 244', ' 245', ' 246', ' 247', ' 248', ' 249', ' 250', ' 251', ' 252', ' 253', ' 254', ' 255', ' 256', ' 257', ' 258', ' 259', ' 260', ' 261', ' 262', ' 263', ' 264', ' 265', ' 266', ' 267', ' 268', ' 269', ' 270', ' 271', ' 272', ' 273', ' 274', ' 275', ' 276', ' 277', ' 278', ' 279', ' 280', ' 281', ' 282', ' 283', ' 284', ' 285', ' 286', ' 287', ' 288', ' 289', ' 290', ' 291', ' 292', ' 293', ' 294', ' 295', ' 296', ' 297', ' 298', ' 299', ' 300', ' 301', ' 302', ' 303', ' 304', ' 305', ' 306', ' 307', ' 308', ' 309', ' 310', ' 311', ' 312', ' 313', ' 314', ' 315', ' 316', ' 317', ' 318', ' 319', ' 320', ' 321', ' 322', ' 323', ' 324', ' 325', ' 326', ' 327', ' 328', ' 329', ' 330', ' 331', ' 332', ' 333', ' 334', ' 335', ' 336', ' 337', ' 338', ' 339', ' 340', ' 341', ' 342', ' 343', ' 344', ' 345', ' 346', ' 347', ' 348', ' 349', ' 350', ' 351', ' 352', ' 353', ' 354', ' 355', ' 356', ' 357', ' 358', ' 359', ' 360', ' 361', ' 362', ' 363', ' 364', ' 365', ' 366', ' 367', ' 368', ' 369', ' 370', ' 371']}

So, 2 keys (chain A and chain B) and 2 lists (resNumber for chain A and resNumber for chainB).

Could you help me generalize this script for any .pdb file? Thank you!

The first few lines of the .pdb file format look like this:

CRYST1   91.372  118.560   70.786  90.00  90.00  90.00 P 1           1
ATOM      1  N   LYS A   1      10.246  29.908   8.932  0.00  0.00      A     
ATOM      2  HT1 LYS A   1      11.053  29.331   8.619  0.00  0.00      A     
ATOM      3  HT2 LYS A   1      10.405  30.386   9.842  0.00  0.00      A     
ATOM      4  HT3 LYS A   1      10.211  30.643   8.197  0.00  0.00      A     
ATOM      5  CA  LYS A   1       9.010  29.017   8.844  0.00  0.00      A     
ATOM      6  HA  LYS A   1       9.395  28.160   8.311  0.00  0.00      A     
ATOM      7  CB  LYS A   1       8.484  28.723  10.313  0.00  0.00      A     
ATOM      8  HB1 LYS A   1       9.376  28.807  10.970  0.00  0.00      A     
ATOM      9  HB2 LYS A   1       7.797  29.544  10.609  0.00  0.00      A     
ATOM     10  CG  LYS A   1       7.855  27.321  10.494  0.00  0.00      A     
ATOM     11  HG1 LYS A   1       7.016  27.501  11.199  0.00  0.00      A     
ATOM     12  HG2 LYS A   1       7.294  26.942   9.613  0.00  0.00      A     
ATOM     13  CD  LYS A   1       8.769  26.282  10.991  0.00  0.00      A     
ATOM     14  HD1 LYS A   1       9.376  26.065  10.088  0.00  0.00      A     
ATOM     15  HD2 LYS A   1       9.476  26.682  11.750  0.00  0.00      A     
ATOM     16  CE  LYS A   1       7.894  25.110  11.592  0.00  0.00      A     
ATOM     17  HE1 LYS A   1       7.347  25.505  12.475  0.00  0.00      A    

or so you can also see chain B:

ATOM   3802  N   TYR B 240      -9.050 -41.325  16.074  0.00  0.00      B     
ATOM   3803  HN  TYR B 240      -8.672 -40.404  16.021  0.00  0.00      B     
ATOM   3804  CA  TYR B 240     -10.166 -41.491  15.204  0.00  0.00      B     
ATOM   3805  HA  TYR B 240      -9.685 -41.605  14.243  0.00  0.00      B     
ATOM   3806  CB  TYR B 240     -10.940 -42.818  15.365  0.00  0.00      B     
ATOM   3807  HB1 TYR B 240     -10.241 -43.631  15.078  0.00  0.00      B     
ATOM   3808  HB2 TYR B 240     -11.241 -43.061  16.407  0.00  0.00      B     
ATOM   3809  CG  TYR B 240     -12.233 -42.972  14.454  0.00  0.00      B     
ATOM   3810  CD1 TYR B 240     -12.102 -43.272  13.086  0.00  0.00      B     
ATOM   3811  HD1 TYR B 240     -11.100 -43.348  12.692  0.00  0.00      B     
ATOM   3812  CE1 TYR B 240     -13.248 -43.404  12.343  0.00  0.00      B     
ATOM   3813  HE1 TYR B 240     -13.093 -43.818  11.358  0.00  0.00      B     

If you need more information about the .pdb file format, here is a link.


Solution

  • my solution is the following:

    #Create an empty dictionary
    pdb_dict={}
    
    #1. create a list containing each file as a list
    all_lines=[filter(lambda x: x != '',line.strip('\n').split(' ')) for line in open('test.pdb', 'r').readlines()]
    
    #2.Use a set comprehension to identify all unique chains in the file
    #(This approach assumes that there will be no two different chains with
    #the same name in the same file)
    chains={line[-1] for line in all_lines if line[0]==('ATOM' or 'HETATM')}
    
    #3.Create a dictionary key for each chain and append the residue numbers
    for chain in chains:
        pdb_dict[chain]=[line[1] for line in all_lines if line[0]==('ATOM' or 'HETATM') and line[-1]==chain]
    

    This approach consists of three steps:

    First, you read your file into a list of lists. As the values in your file are space separated, you can split each line you get from open('test.pdb', 'r').readlines(). Because the number of spaces separating the values is variable, you will get some values in your list that are just spaces. Using the lambda function, you then filter out each element that is just a space (' ') from your list (which is a line of the file). Now you can basically access the information in each list by its index, which corresponds to the column in the pdb file (starting from column 0).

    Secondly, you iterate through your previously created list of lines and identify all unique chains in the file. This is what the set comprehension does.

    Finally, you iterate through your set of chains. For each chain, you create a key in the dictionary and append all residue numbers assigned to this chain to it.

    The other values you can parse easily with the all_lines list we created earlier, based on each values index in the list, like:

    for line in all_lines:
        atomType=line[0]
        atomSerialNumber=line[1]
        atomName=line[2]
        .
        .
        .
    

    I used the following example file:

    CRYST1   91.372  118.560   70.786  90.00  90.00  90.00 P 1           1
    ATOM      1  N   LYS A   1      10.246  29.908   8.932  0.00  0.00      A
    ATOM      2  HT1 LYS A   1      11.053  29.331   8.619  0.00  0.00      A
    ATOM      3  HT2 LYS A   1      10.405  30.386   9.842  0.00  0.00      A
    ATOM      4  HT3 LYS A   1      10.211  30.643   8.197  0.00  0.00      A
    ATOM      5  CA  LYS A   1       9.010  29.017   8.844  0.00  0.00      A
    ATOM      6  HA  LYS A   1       9.395  28.160   8.311  0.00  0.00      A
    ATOM      7  CB  LYS A   1       8.484  28.723  10.313  0.00  0.00      A
    ATOM      8  HB1 LYS A   1       9.376  28.807  10.970  0.00  0.00      A
    ATOM      9  HB2 LYS A   1       7.797  29.544  10.609  0.00  0.00      A
    ATOM     10  CG  LYS A   1       7.855  27.321  10.494  0.00  0.00      A
    ATOM     11  HG1 LYS A   1       7.016  27.501  11.199  0.00  0.00      A
    ATOM     12  HG2 LYS A   1       7.294  26.942   9.613  0.00  0.00      A
    ATOM     13  CD  LYS A   1       8.769  26.282  10.991  0.00  0.00      A
    ATOM     14  HD1 LYS A   1       9.376  26.065  10.088  0.00  0.00      A
    ATOM     15  HD2 LYS A   1       9.476  26.682  11.750  0.00  0.00      A
    ATOM     16  CE  LYS A   1       7.894  25.110  11.592  0.00  0.00      A
    ATOM     17  HE1 LYS A   1       7.347  25.505  12.475  0.00  0.00      A
    ATOM   1800  N   TYR B 240      -9.050 -41.325  16.074  0.00  0.00      B
    ATOM   1802  HN  TYR B 240      -8.672 -40.404  16.021  0.00  0.00      B
    ATOM   1803  CA  TYR B 240     -10.166 -41.491  15.204  0.00  0.00      B
    ATOM   1804  HA  TYR B 240      -9.685 -41.605  14.243  0.00  0.00      B
    ATOM   1805  CB  TYR B 240     -10.940 -42.818  15.365  0.00  0.00      B
    ATOM   1806  HB1 TYR B 240     -10.241 -43.631  15.078  0.00  0.00      B
    ATOM   1807  HB2 TYR B 240     -11.241 -43.061  16.407  0.00  0.00      B
    ATOM   1808  CG  TYR B 240     -12.233 -42.972  14.454  0.00  0.00      B
    ATOM   1810  CD1 TYR B 240     -12.102 -43.272  13.086  0.00  0.00      B
    ATOM   1811  HD1 TYR B 240     -11.100 -43.348  12.692  0.00  0.00      B
    ATOM   1812  CE1 TYR B 240     -13.248 -43.404  12.343  0.00  0.00      B
    ATOM   1813  HE1 TYR B 240     -13.093 -43.818  11.358  0.00  0.00      B
    ATOM   1814  N   TYR B 240      -9.050 -41.325  16.074  0.00  0.00      B
    ATOM   1815  HN  TYR B 240      -8.672 -40.404  16.021  0.00  0.00      B
    ATOM   1816  CA  TYR B 240     -10.166 -41.491  15.204  0.00  0.00      B
    ATOM   1817  HA  TYR B 240      -9.685 -41.605  14.243  0.00  0.00      B
    ATOM   1818  CB  TYR B 240     -10.940 -42.818  15.365  0.00  0.00      B
    ATOM   3807  HB1 TYR C 240     -10.241 -43.631  15.078  0.00  0.00      C
    ATOM   3808  HB2 TYR C 240     -11.241 -43.061  16.407  0.00  0.00      C
    ATOM   3809  CG  TYR C 240     -12.233 -42.972  14.454  0.00  0.00      C
    ATOM   3810  CD1 TYR C 240     -12.102 -43.272  13.086  0.00  0.00      C
    ATOM   3811  HD1 TYR C 240     -11.100 -43.348  12.692  0.00  0.00      C
    ATOM   3812  CE1 TYR C 240     -13.248 -43.404  12.343  0.00  0.00      C
    ATOM   3813  HE1 TYR C 240     -13.093 -43.818  11.358  0.00  0.00      C
    

    Running the code described above gives the desired result:

    pdb_dict{'A': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], 'C': ['3807', '3808', '3809', '3810', '3811', '3812', '3813'], 'B': ['1800', '1802', '1803', '1804', '1805', '1806', '1807', '1808', '1810', '1811', '1812', '1813', '1814', '1815', '1816', '1817', '1818']}