Search code examples
bashsplitchemistry

Splitting a mol2 text file for every nth occurrences of a string in bash and printing to subfiles


I am trying to separate a text file containing millions of molecules in mol2 format using bash. An example of the code below is shown. The repeating string is @<TRIPOS>MOLECULE so I would like to use this to write subfiles every nth occurrence of it. Roughly every 30,000th time. I have tried to write this script below but nothing writes to subfiles

@<TRIPOS>MOLECULE
ZINC000169688935
54    57    1     0     0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1      C1     -1.3065    0.9799    0.3206 C.3      1 ZINC000169688935   -0.1500
2      C2     -1.4308   -0.5375    0.4726 C.3      1 ZINC000169688935   -0.0500
...
@<TRIPOS>MOLECULE
ZINC000022925452
49    49    1     0     0
SMALL
USER_CHARGES
@<TRIPOS>ATOM
1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400
2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000
...
#!/bin/bash

FILENAME="450mw_comp.mol2"
mol_counter="0"
file_counter="0"
new_file="$FILENAME.$file_counter"

cat $FILENAME
IFS=' '
while read -r line;
do
printf "$line" >> $new_file
if "$line" == *"@<TRIPOS>MOLECULE"*
then
     mol_counter=$((mol_counter+1))

if [ $mol_counter -eq 100 ]
then
    file_counter=$((file_counter+1))
    mol_counter=0
fi

done

Solution

  • As mentioned in the comments parsing millions of lines in a bash loop is not recommended as it will be extremely slow.

    NOTE: see 2nd half of answer for code to create sample input file

    One awk idea:

    fname='450mw_comp.mol2'
    n=2                                                  # limit each output file to 2 molecules
    
    awk -v n="${n}" '
    
    /^@<TRIPOS>MOLECULE/ { molcount++                    # pattern match? increment molecule counter and ...
                           if (molcount % n == 1) {      # if counter modulo n == 0 then ...
                              close(outfile)             # close current output file and ...
                              sfx=sprintf("%05d",++c)    # create new suffix (0-padded 5-digit #) and ...
                              outfile=FILENAME "." sfx   # create new output file name
                           }
                         }
                         { print $0 > outfile }
    ' "${fname}"
    

    NOTES:

    • remove comments to declutter code
    • change bash variable n as needed (eg, n=30000)
    • adjust output file name suffix width (%05) as needed

    Verification:

    $ fname='450mw_comp.mol2'
    
    $ ls -l "${fname}"*
    -rw-rw----+ 1 username None 2794 Feb 17 09:28 450mw_comp.mol2
    -rw-rw----+ 1 username None  508 Feb 17 09:28 450mw_comp.mol2.00001
    -rw-rw----+ 1 username None  508 Feb 17 09:28 450mw_comp.mol2.00002
    -rw-rw----+ 1 username None  508 Feb 17 09:28 450mw_comp.mol2.00003
    -rw-rw----+ 1 username None  508 Feb 17 09:28 450mw_comp.mol2.00004
    -rw-rw----+ 1 username None  508 Feb 17 09:28 450mw_comp.mol2.00005
    -rw-rw----+ 1 username None  254 Feb 17 09:28 450mw_comp.mol2.00006
    
    $ cat "${fname}".00001
    @<TRIPOS>MOLECULE
    ZINC000022XXX001
    49    49    1     0     0
    SMALL
    USER_CHARGES
    @<TRIPOS>ATOM
    1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400
    2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000
    @<TRIPOS>MOLECULE
    ZINC000022XXX002
    49    49    1     0     0
    SMALL
    USER_CHARGES
    @<TRIPOS>ATOM
    1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400
    2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000
    
    $ for f in "${fname}".*
    do
        printf "\n###### ${f}\n\n"
        grep '^ZINC' "${f}"
    done
    
    ###### 450mw_comp.mol2.00001
    
    ZINC000022XXX001
    ZINC000022XXX002
    
    ###### 450mw_comp.mol2.00002
    
    ZINC000022XXX003
    ZINC000022XXX004
    
    ###### 450mw_comp.mol2.00003
    
    ZINC000022XXX005
    ZINC000022XXX006
    
    ###### 450mw_comp.mol2.00004
    
    ZINC000022XXX007
    ZINC000022XXX008
    
    ###### 450mw_comp.mol2.00005
    
    ZINC000022XXX009
    ZINC000022XXX010
    
    ###### 450mw_comp.mol2.00006
    
    ZINC000022XXX011
    

    For demonstration purposes we'll generate an input file with 11 sample molecules; we'll terminate each ^ZINC... string with XXX plus a 0-padded 3-digit counter (we won't worry about updating the ATOM entries with matching ZINC... strings):

    fname='450mw_comp.mol2'
    
    awk '
    BEGIN {
    for (i=1;i<=11;i++)
    
    printf "@<TRIPOS>MOLECULE\n\
    ZINC000022XXX%03d\n\
    49    49    1     0     0\n\
    SMALL\n\
    USER_CHARGES\n\
    @<TRIPOS>ATOM\n\
    1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400\n\
    2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000\n",i
    }
    ' > "${fname}"
    

    Verification:

    $ head "${fname}"
    ZINC000022XXX001
    49    49    1     0     0
    SMALL
    USER_CHARGES
    @<TRIPOS>ATOM
    1      N1     -3.9528    1.8161   -2.8615 N.am     1 ZINC000022925452   -1.2400
    2      S2     -2.7113    2.7835   -3.3766 S.o2     1 ZINC000022925452    2.7000
    @<TRIPOS>MOLECULE
    ZINC000022XXX002
    
    $ grep '^ZINC' "${fname}"
    ZINC000022XXX001
    ZINC000022XXX002
    ZINC000022XXX003
    ZINC000022XXX004
    ZINC000022XXX005
    ZINC000022XXX006
    ZINC000022XXX007
    ZINC000022XXX008
    ZINC000022XXX009
    ZINC000022XXX010
    ZINC000022XXX011