Search code examples
bashlarge-data

Splitting a large file containing multiple molecules


I have a file that contains 10,000 molecules. Each molecule is ending with keyword $$$$. I want to split the main files into 10,000 separate files so that each file will have only 1 molecule. Each molecule have different number of lines. I have tried sed on test_file.txt as:

sed '/$$$$/q' test_file.txt > out.txt

input:

$ cat test_file.txt
ashu
vishu
jyoti
$$$$
Jatin
Vishal
Shivani
$$$$  

output:

$ cat out.txt
ashu
vishu
jyoti
$$$$

I can loop it through whole main file to create 10,000 separate files but how to delete the last molecule that was just moved to new file from main file. Or please suggest if there is a better method for it, which I believe there is. Thanks.

Edit1:

$ cat short_library.sdf

untitled.cdx
csChFnd80/09142214492D

 31 34  0  0  0  0  0  0  0  0999 V2000
    8.4660    6.2927    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.4660    4.8927    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2124    2.0951    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4249    2.7951    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
  2  3  1  0  0  0  0
 30 31  1  0  0  0  0
 31 26  1  0  0  0  0
M  END
>  <Mol_ID> (1)
1

>  <Formula> (1)
C22H24ClFN4O3

>  <URL> (1)
http://www.selleckchem.com/products/Gefitinib.html

$$$$
Dimesna.cdx
csChFnd80/09142214492D

 16 13  0  0  0  0  0  0  0  0999 V2000
    2.4249    1.4000    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    3.6415    2.1024    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.8540    1.4024    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4904    1.7512    0.0000 Na  0  3  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  1 14  2  0  0  0  0
M  END
>  <Mol_ID> (2)
2

>  <Formula> (2)
C4H8Na2O6S4


>  <URL> (2)
http://www.selleckchem.com/products/Dimesna.html

$$$$

Solution

  • Here's a simple solution with standard awk:

    LANG=C awk '
        { mol = (mol == "" ? $0 : mol "\n" $0) }
        /^\$\$\$\$\r?$/ {
            outFile = "molecule" ++fn ".sdf"
            print mol > outFile
            close(outFile) 
            mol = ""
        }
    ' input.sdf