EDIT: replaced the example file with a real example; replaced the nbratoms
variable by nbrbonds
.
Beginner question.
I would like to optimize the following script for huge files (100G+). I've discovered the existence of itertools yesterday but don't have a clue.
f = open(sys.argv[1], "r")
out = open(sys.argv[2], 'w')
lines = f.read().split('\n@<TRIPOS>MOLECULE')
for i in lines:
ii=i.split('\n@<TRIPOS>',4)
header=ii[0]
infos=header.split('\n')[2]
nbrbonds=infos.split(' ')[2]
if str(nbrbonds) in ii[2]:
out.write('\n@<TRIPOS>MOLECULE'+str(i))
out.close()
f.close()
The processed file is composed of concatenated 200,000+ single MOL2 files (last example below).
The idea of the script is to first split the input file into items delimited by two @<TRIPOS>MOLECULE
(=first line of a new MOL2 file); then to split these items according to lines starting with @<TRIPOS>
into 4 parts (i.e.,@<TRIPOS>MOLECULE
, @<TRIPOS>ATOM
, @<TRIPOS>BOND
and @<TRIPOS>ALT_TYPE
). For each single MOL2 file, I want to check if the value at the location of (the second)14
in the header (different in each single MOL2 file)
@<TRIPOS>MOLECULE
Z1198223644
14 14 0 0 0
USER_CHARGES
occurs in the 3rd part (below) of the single file:
@<TRIPOS>BOND
1 1 2 1
2 2 3 1
3 2 4 1
4 2 5 1
5 5 6 ar
6 5 11 ar
...
If it does -> print it to outputfile with \n@<TRIPOS>MOLECULE
as first line (essentially just the way a single MOL2 file looks).
It seems to work as it is, but I fear it's way too amateur. Additionally, I don't know how to implement a step that would avoid that the output file starts with a double header mark like this
@<TRIPOS>MOLECULE@<TRIPOS>MOLECULE
Z1198223644
...
Any help welcome! I joined a file containing 6 concatenated MOL2 files; odd files are correct; even files - wrong.
@<TRIPOS>MOLECULE
Z1198223644
14 14 0 0 0
USER_CHARGES
@<TRIPOS>ATOM
1 F1 23.5932 2.0831 -52.2012 F 1 LIG -0.15900
2 C2 22.4195 1.3866 -52.4217 C.3 1 LIG 0.88300
3 F3 22.5324 0.1265 -51.8643 F 1 LIG -0.15900
4 F4 21.3805 2.0570 -51.7993 F 1 LIG -0.15900
5 C5 22.1912 1.2555 -53.9016 C.ar 1 LIG 0.04500
6 C6 21.0466 1.7681 -54.5284 C.ar 1 LIG -0.13400
7 C7 20.8964 1.6126 -55.9046 C.ar 1 LIG -0.19400
8 C8 21.8881 0.9505 -56.6271 C.ar 1 LIG 0.20700
9 O9 21.7710 0.7997 -57.8724 O.2 1 LIG -0.49500
10 N10 22.9825 0.4691 -55.9778 N.ar 1 LIG 0.11300
11 N11 23.1254 0.6186 -54.6592 N.ar 1 LIG -0.68800
12 H12 20.2773 2.2819 -53.9665 H 1 LIG 0.21400
13 H13 20.0176 2.0033 -56.4027 H 1 LIG 0.20000
14 H14 23.7285 -0.0277 -56.5143 H 1 LIG 0.32600
@<TRIPOS>BOND
1 1 2 1
2 2 3 1
3 2 4 1
4 2 5 1
5 5 6 ar
6 5 11 ar
7 6 7 ar
8 7 8 ar
9 8 9 2
10 8 10 ar
11 10 11 ar
12 6 12 1
13 7 13 1
14 10 14 1
@<TRIPOS>ALT_TYPE
CGenFF_4.0_ALT_TYPE_SET
CGenFF_4.0 1 FGA3 2 CG302 3 FGA3 4 FGA3 5 CG2R62 6 CG2R62 7 CG2R62 8 CG2R63 9 OG2D4 10 NG2R61 11 NG2R62 12 HGR62 13 HGR62 14 HGP1
for memory efficient reading of file lines (i.e. split by \n
or OS-specific line endings) you could have used:
with open(sys.argv[1], "r") as f, open(sys.argv[2], 'w') as out:
for i in f:
...
but you need to split by '\n@<TRIPOS>MOLECULE'
, so I recommend to create your own generator (a function with yield
) to produce MOLECULE chunks, loading only 1 chunk into memory (lazy reading the large file, instead of using the greedy read()
):
def mol_reader(file_stream):
chunk = []
for line in file_stream:
line = line.strip()
if line == "@<TRIPOS>MOLECULE":
if chunk:
yield chunk
chunk = []
else:
chunk.append(line)
yield chunk
with open(sys.argv[1], "r") as f, open(sys.argv[2], 'w') as out:
for mol_file in mol_reader(f):
...