Search code examples
pythonsplitpython-itertools

Optimize script for huge concatenated files / loop to discard defective items from list of files


EDIT: replaced the example file with a real example; replaced the nbratoms variable by nbrbonds.

Beginner question.

I would like to optimize the following script for huge files (100G+). I've discovered the existence of itertools yesterday but don't have a clue.

f = open(sys.argv[1], "r")
out = open(sys.argv[2], 'w')

lines = f.read().split('\n@<TRIPOS>MOLECULE')

for i in lines: 
    ii=i.split('\n@<TRIPOS>',4) 
    header=ii[0]
    infos=header.split('\n')[2]
    nbrbonds=infos.split(' ')[2]
    if str(nbrbonds) in ii[2]:
        out.write('\n@<TRIPOS>MOLECULE'+str(i))

out.close()
f.close()

The processed file is composed of concatenated 200,000+ single MOL2 files (last example below). The idea of the script is to first split the input file into items delimited by two @<TRIPOS>MOLECULE (=first line of a new MOL2 file); then to split these items according to lines starting with @<TRIPOS> into 4 parts (i.e.,@<TRIPOS>MOLECULE, @<TRIPOS>ATOM, @<TRIPOS>BOND and @<TRIPOS>ALT_TYPE). For each single MOL2 file, I want to check if the value at the location of (the second)14 in the header (different in each single MOL2 file)

@<TRIPOS>MOLECULE
Z1198223644
14 14 0 0 0
USER_CHARGES

occurs in the 3rd part (below) of the single file:

@<TRIPOS>BOND
1       1       2 1
2       2       3 1
3       2       4 1
4       2       5 1
5       5       6 ar
6       5      11 ar
 ...

If it does -> print it to outputfile with \n@<TRIPOS>MOLECULE as first line (essentially just the way a single MOL2 file looks). It seems to work as it is, but I fear it's way too amateur. Additionally, I don't know how to implement a step that would avoid that the output file starts with a double header mark like this

@<TRIPOS>MOLECULE@<TRIPOS>MOLECULE
Z1198223644
...

Any help welcome! I joined a file containing 6 concatenated MOL2 files; odd files are correct; even files - wrong.

@<TRIPOS>MOLECULE
Z1198223644
14 14 0 0 0
USER_CHARGES
@<TRIPOS>ATOM
  1 F1         23.5932    2.0831  -52.2012 F      1 LIG      -0.15900
  2 C2         22.4195    1.3866  -52.4217 C.3    1 LIG       0.88300
  3 F3         22.5324    0.1265  -51.8643 F      1 LIG      -0.15900
  4 F4         21.3805    2.0570  -51.7993 F      1 LIG      -0.15900
  5 C5         22.1912    1.2555  -53.9016 C.ar   1 LIG       0.04500
  6 C6         21.0466    1.7681  -54.5284 C.ar   1 LIG      -0.13400
  7 C7         20.8964    1.6126  -55.9046 C.ar   1 LIG      -0.19400
  8 C8         21.8881    0.9505  -56.6271 C.ar   1 LIG       0.20700
  9 O9         21.7710    0.7997  -57.8724 O.2    1 LIG      -0.49500
 10 N10        22.9825    0.4691  -55.9778 N.ar   1 LIG       0.11300
 11 N11        23.1254    0.6186  -54.6592 N.ar   1 LIG      -0.68800
 12 H12        20.2773    2.2819  -53.9665 H      1 LIG       0.21400
 13 H13        20.0176    2.0033  -56.4027 H      1 LIG       0.20000
 14 H14        23.7285   -0.0277  -56.5143 H      1 LIG       0.32600
@<TRIPOS>BOND
  1       1       2 1
  2       2       3 1
  3       2       4 1
  4       2       5 1
  5       5       6 ar
  6       5      11 ar
  7       6       7 ar
  8       7       8 ar
  9       8       9 2
 10       8      10 ar
 11      10      11 ar
 12       6      12 1
 13       7      13 1
 14      10      14 1
@<TRIPOS>ALT_TYPE
CGenFF_4.0_ALT_TYPE_SET
CGenFF_4.0 1 FGA3 2 CG302 3 FGA3 4 FGA3 5 CG2R62 6 CG2R62 7 CG2R62 8 CG2R63 9 OG2D4 10 NG2R61 11 NG2R62 12 HGR62 13 HGR62 14 HGP1

Solution

  • for memory efficient reading of file lines (i.e. split by \n or OS-specific line endings) you could have used:

    with open(sys.argv[1], "r") as f, open(sys.argv[2], 'w') as out:
        for i in f:
            ...
    

    but you need to split by '\n@<TRIPOS>MOLECULE', so I recommend to create your own generator (a function with yield) to produce MOLECULE chunks, loading only 1 chunk into memory (lazy reading the large file, instead of using the greedy read()):

    def mol_reader(file_stream):
        chunk = []
        for line in file_stream:
            line = line.strip()
            if line == "@<TRIPOS>MOLECULE":
                if chunk:
                    yield chunk
                chunk = []
            else:
                chunk.append(line)
        yield chunk
    
    
    with open(sys.argv[1], "r") as f, open(sys.argv[2], 'w') as out:
        for mol_file in mol_reader(f):
            ...
    

    e.g. https://repl.it/@PeterAprillion/chunk-iterator