I often parse formatted text files using Python (for biology research, but I'll try and ask my question in a way you won't need biology background.) I deal with a type of file -called a pdb file- that contains 3D structure of a protein in a formatted text. This is an example:
HEADER CHROMOSOMAL PROTEIN 02-JAN-87 1UBQ
TITLE STRUCTURE OF UBIQUITIN REFINED AT 1.8 ANGSTROMS RESOLUTION
REMARK 1
REMARK 1 REFERENCE 1
REMARK 1 AUTH S.VIJAY-KUMAR,C.E.BUGG,K.D.WILKINSON,R.D.VIERSTRA,
REMARK 1 AUTH 2 P.M.HATFIELD,W.J.COOK
REMARK 1 TITL COMPARISON OF THE THREE-DIMENSIONAL STRUCTURES OF HUMAN,
REMARK 1 TITL 2 YEAST, AND OAT UBIQUITIN
REMARK 1 REF J.BIOL.CHEM. V. 262 6396 1987
REMARK 1 REFN ISSN 0021-9258
ATOM 1 N MET A 1 27.340 24.430 2.614 1.00 9.67 N
ATOM 2 CA MET A 1 26.266 25.413 2.842 1.00 10.38 C
ATOM 3 C MET A 1 26.913 26.639 3.531 1.00 9.62 C
ATOM 4 O MET A 1 27.886 26.463 4.263 1.00 9.62 O
ATOM 5 CB MET A 1 25.112 24.880 3.649 1.00 13.77 C
ATOM 6 CG MET A 1 25.353 24.860 5.134 1.00 16.29 C
ATOM 7 SD MET A 1 23.930 23.959 5.904 1.00 17.17 S
ATOM 8 CE MET A 1 24.447 23.984 7.620 1.00 16.11 C
ATOM 9 N GLN A 2 26.335 27.770 3.258 1.00 9.27 N
ATOM 10 CA GLN A 2 26.850 29.021 3.898 1.00 9.07 C
ATOM 11 C GLN A 2 26.100 29.253 5.202 1.00 8.72 C
ATOM 12 O GLN A 2 24.865 29.024 5.330 1.00 8.22 O
ATOM 13 CB GLN A 2 26.733 30.148 2.905 1.00 14.46 C
ATOM 14 CG GLN A 2 26.882 31.546 3.409 1.00 17.01 C
ATOM 15 CD GLN A 2 26.786 32.562 2.270 1.00 20.10 C
ATOM 16 OE1 GLN A 2 27.783 33.160 1.870 1.00 21.89 O
ATOM 17 NE2 GLN A 2 25.562 32.733 1.806 1.00 19.49 N
ATOM 18 N ILE A 3 26.849 29.656 6.217 1.00 5.87 N
ATOM 19 CA ILE A 3 26.235 30.058 7.497 1.00 5.07 C
ATOM 20 C ILE A 3 26.882 31.428 7.862 1.00 4.01 C
ATOM 21 O ILE A 3 27.906 31.711 7.264 1.00 4.61 O
ATOM 22 CB ILE A 3 26.344 29.050 8.645 1.00 6.55 C
ATOM 23 CG1 ILE A 3 27.810 28.748 8.999 1.00 4.72 C
ATOM 24 CG2 ILE A 3 25.491 27.771 8.287 1.00 5.58 C
ATOM 25 CD1 ILE A 3 27.967 28.087 10.417 1.00 10.83 C
TER 26 ILE A 3
HETATM 604 O HOH A 77 45.747 30.081 19.708 1.00 12.43 O
HETATM 605 O HOH A 78 19.168 31.868 17.050 1.00 12.65 O
HETATM 606 O HOH A 79 32.010 38.387 19.636 1.00 12.83 O
HETATM 607 O HOH A 80 42.084 27.361 21.953 1.00 22.27 O
END
ATOM
marks beginning of a line that contains atomic coordinates. TER
marks end of coordinates. I want to take the whole block of text that contains atomic coordinates so I use:
import re
f = open('example.pdb', 'r+')
content = f.read()
coor = re.search('ATOM.*TER', content) #take everthing between ATOM and TER
But it matches nothing. There must be a way of taking a whole block of text by using regex. I also don't understand why this regex pattern doesn't work. Help is appreciated.
This should match (but I haven't actually tested it):
coor = re.search('ATOM.*TER', content, re.DOTALL)
If you read the documentation on DOTALL
, you will understand why it wasn't working.
A still better way of writing the above is
coor = re.search(r'^ATOM.*^TER', content, re.MULTILINE | re.DOTALL)
where it is required that ATOM
and TER
come after newlines, and where raw string notation is being used, which is customary for regular expressions (though it won't make any difference in this case).
You could also avoid regular expressions altogether:
start = content.index('\nATOM')
end = content.index('\nTER', start)
coor = content[start:end]
(This will actually not include the TER
in the result, which may be better).