First, let me start by saying that I'm doing this as a Python exercise and I'm not allowed to use Biopython.
I am writing a script that will help me parse any .pdb file generated from a trajectory. I am trying to create a dictionary that would link the chain variable with the resNumber variable. Although I solved the issue for a specific .pdb file, which only has 2 chains, I would like to make this script work for any .pdb file, no matter the number of chains. Here is what I wrote:
import sys
pdbTraj = open('md20_aligned_3frames.pdb', 'r')
pdbTraj_line = pdbTraj.readlines()
newFile = open('newfile.txt', 'w')
pdbDict = {}
resNumberList1 = []
resNumberList2 = []
chainTry = "A"
for line in pdbTraj_line:
if line.startswith(("ATOM" or "HETATM")):
atomType = line[0:6]
atomSerialNumber = line[6:11]
atomName = line[12:16]
resName = line[17:20]
chain = line[21]
resNumber = line[22:26]
coorX = line[30:38]
coorY = line[38:46]
coorZ = line[46:54]
occupancy = line[54:60]
temperatureFact = line[60:66]
segmentIdentifier = line[72:76]
elementSymbol = line[76:78]
if chain == chainTry:
resNumberList1.append(resNumber)
pdbDict[chain] = list(dict.fromkeys(resNumberList1))
else:
resNumberList2.append(resNumber)
pdbDict[chain] = list(dict.fromkeys(resNumberList2))
print(pdbDict)
This is the result I get:
{'A': [' 1', ' 2', ' 3', ' 4', ' 5', ' 6', ' 7', ' 8', ' 9', ' 10', ' 11', ' 12', ' 13', ' 14', ' 15', ' 16', ' 17'], 'B': [' 19', ' 20', ' 21', ' 22', ' 23', ' 24', ' 25', ' 26', ' 27', ' 28', ' 29', ' 30', ' 31', ' 32', ' 33', ' 34', ' 35', ' 36', ' 37', ' 38', ' 39', ' 40', ' 41', ' 42', ' 43', ' 44', ' 45', ' 46', ' 47', ' 48', ' 49', ' 50', ' 51', ' 52', ' 53', ' 54', ' 55', ' 56', ' 57', ' 58', ' 59', ' 60', ' 61', ' 62', ' 63', ' 64', ' 65', ' 66', ' 67', ' 68', ' 69', ' 70', ' 71', ' 72', ' 73', ' 74', ' 75', ' 76', ' 77', ' 78', ' 79', ' 80', ' 81', ' 82', ' 83', ' 84', ' 85', ' 86', ' 87', ' 88', ' 89', ' 90', ' 91', ' 92', ' 93', ' 94', ' 95', ' 96', ' 97', ' 98', ' 99', ' 100', ' 101', ' 102', ' 103', ' 104', ' 105', ' 106', ' 107', ' 108', ' 109', ' 110', ' 111', ' 112', ' 113', ' 114', ' 115', ' 116', ' 117', ' 118', ' 119', ' 120', ' 121', ' 122', ' 123', ' 124', ' 125', ' 126', ' 127', ' 128', ' 129', ' 130', ' 131', ' 132', ' 133', ' 134', ' 135', ' 136', ' 137', ' 138', ' 139', ' 140', ' 141', ' 142', ' 143', ' 144', ' 145', ' 146', ' 147', ' 148', ' 149', ' 150', ' 151', ' 152', ' 153', ' 154', ' 155', ' 156', ' 157', ' 158', ' 159', ' 160', ' 161', ' 162', ' 163', ' 164', ' 165', ' 166', ' 167', ' 168', ' 169', ' 170', ' 171', ' 172', ' 173', ' 174', ' 175', ' 176', ' 177', ' 178', ' 179', ' 180', ' 181', ' 182', ' 183', ' 184', ' 185', ' 186', ' 187', ' 188', ' 189', ' 190', ' 191', ' 192', ' 193', ' 194', ' 195', ' 196', ' 197', ' 198', ' 199', ' 200', ' 201', ' 202', ' 203', ' 204', ' 205', ' 206', ' 207', ' 208', ' 209', ' 210', ' 211', ' 212', ' 213', ' 214', ' 215', ' 216', ' 217', ' 218', ' 219', ' 220', ' 221', ' 222', ' 223', ' 224', ' 225', ' 226', ' 227', ' 228', ' 229', ' 230', ' 231', ' 232', ' 233', ' 234', ' 235', ' 236', ' 237', ' 238', ' 239', ' 240', ' 241', ' 242', ' 243', ' 244', ' 245', ' 246', ' 247', ' 248', ' 249', ' 250', ' 251', ' 252', ' 253', ' 254', ' 255', ' 256', ' 257', ' 258', ' 259', ' 260', ' 261', ' 262', ' 263', ' 264', ' 265', ' 266', ' 267', ' 268', ' 269', ' 270', ' 271', ' 272', ' 273', ' 274', ' 275', ' 276', ' 277', ' 278', ' 279', ' 280', ' 281', ' 282', ' 283', ' 284', ' 285', ' 286', ' 287', ' 288', ' 289', ' 290', ' 291', ' 292', ' 293', ' 294', ' 295', ' 296', ' 297', ' 298', ' 299', ' 300', ' 301', ' 302', ' 303', ' 304', ' 305', ' 306', ' 307', ' 308', ' 309', ' 310', ' 311', ' 312', ' 313', ' 314', ' 315', ' 316', ' 317', ' 318', ' 319', ' 320', ' 321', ' 322', ' 323', ' 324', ' 325', ' 326', ' 327', ' 328', ' 329', ' 330', ' 331', ' 332', ' 333', ' 334', ' 335', ' 336', ' 337', ' 338', ' 339', ' 340', ' 341', ' 342', ' 343', ' 344', ' 345', ' 346', ' 347', ' 348', ' 349', ' 350', ' 351', ' 352', ' 353', ' 354', ' 355', ' 356', ' 357', ' 358', ' 359', ' 360', ' 361', ' 362', ' 363', ' 364', ' 365', ' 366', ' 367', ' 368', ' 369', ' 370', ' 371']}
So, 2 keys (chain A and chain B) and 2 lists (resNumber for chain A and resNumber for chainB).
Could you help me generalize this script for any .pdb file? Thank you!
The first few lines of the .pdb file format look like this:
CRYST1 91.372 118.560 70.786 90.00 90.00 90.00 P 1 1
ATOM 1 N LYS A 1 10.246 29.908 8.932 0.00 0.00 A
ATOM 2 HT1 LYS A 1 11.053 29.331 8.619 0.00 0.00 A
ATOM 3 HT2 LYS A 1 10.405 30.386 9.842 0.00 0.00 A
ATOM 4 HT3 LYS A 1 10.211 30.643 8.197 0.00 0.00 A
ATOM 5 CA LYS A 1 9.010 29.017 8.844 0.00 0.00 A
ATOM 6 HA LYS A 1 9.395 28.160 8.311 0.00 0.00 A
ATOM 7 CB LYS A 1 8.484 28.723 10.313 0.00 0.00 A
ATOM 8 HB1 LYS A 1 9.376 28.807 10.970 0.00 0.00 A
ATOM 9 HB2 LYS A 1 7.797 29.544 10.609 0.00 0.00 A
ATOM 10 CG LYS A 1 7.855 27.321 10.494 0.00 0.00 A
ATOM 11 HG1 LYS A 1 7.016 27.501 11.199 0.00 0.00 A
ATOM 12 HG2 LYS A 1 7.294 26.942 9.613 0.00 0.00 A
ATOM 13 CD LYS A 1 8.769 26.282 10.991 0.00 0.00 A
ATOM 14 HD1 LYS A 1 9.376 26.065 10.088 0.00 0.00 A
ATOM 15 HD2 LYS A 1 9.476 26.682 11.750 0.00 0.00 A
ATOM 16 CE LYS A 1 7.894 25.110 11.592 0.00 0.00 A
ATOM 17 HE1 LYS A 1 7.347 25.505 12.475 0.00 0.00 A
or so you can also see chain B:
ATOM 3802 N TYR B 240 -9.050 -41.325 16.074 0.00 0.00 B
ATOM 3803 HN TYR B 240 -8.672 -40.404 16.021 0.00 0.00 B
ATOM 3804 CA TYR B 240 -10.166 -41.491 15.204 0.00 0.00 B
ATOM 3805 HA TYR B 240 -9.685 -41.605 14.243 0.00 0.00 B
ATOM 3806 CB TYR B 240 -10.940 -42.818 15.365 0.00 0.00 B
ATOM 3807 HB1 TYR B 240 -10.241 -43.631 15.078 0.00 0.00 B
ATOM 3808 HB2 TYR B 240 -11.241 -43.061 16.407 0.00 0.00 B
ATOM 3809 CG TYR B 240 -12.233 -42.972 14.454 0.00 0.00 B
ATOM 3810 CD1 TYR B 240 -12.102 -43.272 13.086 0.00 0.00 B
ATOM 3811 HD1 TYR B 240 -11.100 -43.348 12.692 0.00 0.00 B
ATOM 3812 CE1 TYR B 240 -13.248 -43.404 12.343 0.00 0.00 B
ATOM 3813 HE1 TYR B 240 -13.093 -43.818 11.358 0.00 0.00 B
If you need more information about the .pdb file format, here is a link.
my solution is the following:
#Create an empty dictionary
pdb_dict={}
#1. create a list containing each file as a list
all_lines=[filter(lambda x: x != '',line.strip('\n').split(' ')) for line in open('test.pdb', 'r').readlines()]
#2.Use a set comprehension to identify all unique chains in the file
#(This approach assumes that there will be no two different chains with
#the same name in the same file)
chains={line[-1] for line in all_lines if line[0]==('ATOM' or 'HETATM')}
#3.Create a dictionary key for each chain and append the residue numbers
for chain in chains:
pdb_dict[chain]=[line[1] for line in all_lines if line[0]==('ATOM' or 'HETATM') and line[-1]==chain]
This approach consists of three steps:
First, you read your file into a list of lists. As the values in your file are space separated, you can split each line you get from open('test.pdb', 'r').readlines()
. Because the number of spaces separating the values is variable, you will get some values in your list that are just spaces. Using the lambda function, you then filter out each element that is just a space (' ') from your list (which is a line of the file). Now you can basically access the information in each list by its index, which corresponds to the column in the pdb file (starting from column 0).
Secondly, you iterate through your previously created list of lines and identify all unique chains in the file. This is what the set comprehension does.
Finally, you iterate through your set of chains. For each chain, you create a key in the dictionary and append all residue numbers assigned to this chain to it.
The other values you can parse easily with the all_lines
list we created earlier, based on each values index in the list, like:
for line in all_lines:
atomType=line[0]
atomSerialNumber=line[1]
atomName=line[2]
.
.
.
I used the following example file:
CRYST1 91.372 118.560 70.786 90.00 90.00 90.00 P 1 1
ATOM 1 N LYS A 1 10.246 29.908 8.932 0.00 0.00 A
ATOM 2 HT1 LYS A 1 11.053 29.331 8.619 0.00 0.00 A
ATOM 3 HT2 LYS A 1 10.405 30.386 9.842 0.00 0.00 A
ATOM 4 HT3 LYS A 1 10.211 30.643 8.197 0.00 0.00 A
ATOM 5 CA LYS A 1 9.010 29.017 8.844 0.00 0.00 A
ATOM 6 HA LYS A 1 9.395 28.160 8.311 0.00 0.00 A
ATOM 7 CB LYS A 1 8.484 28.723 10.313 0.00 0.00 A
ATOM 8 HB1 LYS A 1 9.376 28.807 10.970 0.00 0.00 A
ATOM 9 HB2 LYS A 1 7.797 29.544 10.609 0.00 0.00 A
ATOM 10 CG LYS A 1 7.855 27.321 10.494 0.00 0.00 A
ATOM 11 HG1 LYS A 1 7.016 27.501 11.199 0.00 0.00 A
ATOM 12 HG2 LYS A 1 7.294 26.942 9.613 0.00 0.00 A
ATOM 13 CD LYS A 1 8.769 26.282 10.991 0.00 0.00 A
ATOM 14 HD1 LYS A 1 9.376 26.065 10.088 0.00 0.00 A
ATOM 15 HD2 LYS A 1 9.476 26.682 11.750 0.00 0.00 A
ATOM 16 CE LYS A 1 7.894 25.110 11.592 0.00 0.00 A
ATOM 17 HE1 LYS A 1 7.347 25.505 12.475 0.00 0.00 A
ATOM 1800 N TYR B 240 -9.050 -41.325 16.074 0.00 0.00 B
ATOM 1802 HN TYR B 240 -8.672 -40.404 16.021 0.00 0.00 B
ATOM 1803 CA TYR B 240 -10.166 -41.491 15.204 0.00 0.00 B
ATOM 1804 HA TYR B 240 -9.685 -41.605 14.243 0.00 0.00 B
ATOM 1805 CB TYR B 240 -10.940 -42.818 15.365 0.00 0.00 B
ATOM 1806 HB1 TYR B 240 -10.241 -43.631 15.078 0.00 0.00 B
ATOM 1807 HB2 TYR B 240 -11.241 -43.061 16.407 0.00 0.00 B
ATOM 1808 CG TYR B 240 -12.233 -42.972 14.454 0.00 0.00 B
ATOM 1810 CD1 TYR B 240 -12.102 -43.272 13.086 0.00 0.00 B
ATOM 1811 HD1 TYR B 240 -11.100 -43.348 12.692 0.00 0.00 B
ATOM 1812 CE1 TYR B 240 -13.248 -43.404 12.343 0.00 0.00 B
ATOM 1813 HE1 TYR B 240 -13.093 -43.818 11.358 0.00 0.00 B
ATOM 1814 N TYR B 240 -9.050 -41.325 16.074 0.00 0.00 B
ATOM 1815 HN TYR B 240 -8.672 -40.404 16.021 0.00 0.00 B
ATOM 1816 CA TYR B 240 -10.166 -41.491 15.204 0.00 0.00 B
ATOM 1817 HA TYR B 240 -9.685 -41.605 14.243 0.00 0.00 B
ATOM 1818 CB TYR B 240 -10.940 -42.818 15.365 0.00 0.00 B
ATOM 3807 HB1 TYR C 240 -10.241 -43.631 15.078 0.00 0.00 C
ATOM 3808 HB2 TYR C 240 -11.241 -43.061 16.407 0.00 0.00 C
ATOM 3809 CG TYR C 240 -12.233 -42.972 14.454 0.00 0.00 C
ATOM 3810 CD1 TYR C 240 -12.102 -43.272 13.086 0.00 0.00 C
ATOM 3811 HD1 TYR C 240 -11.100 -43.348 12.692 0.00 0.00 C
ATOM 3812 CE1 TYR C 240 -13.248 -43.404 12.343 0.00 0.00 C
ATOM 3813 HE1 TYR C 240 -13.093 -43.818 11.358 0.00 0.00 C
Running the code described above gives the desired result:
pdb_dict{'A': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], 'C': ['3807', '3808', '3809', '3810', '3811', '3812', '3813'], 'B': ['1800', '1802', '1803', '1804', '1805', '1806', '1807', '1808', '1810', '1811', '1812', '1813', '1814', '1815', '1816', '1817', '1818']}