Search code examples
linuxbashgeneratorletterspymol

Linux script to to get all the possible 7 letter combinations to generate peptides in pymol?


I'm looking to generate a folder with pdb file of every peptide of 7 (lentgh) specific amino acids. I was thinking to firstly making a simple linux script to generate a file with all 7 letter combination like this :

AAAAAAA
AAAAAAB
AAAAABA
AAAABAA
AAABAAA
AABAAAA
ABAAAAA
BAAAAAA
AAAAABB
AAAABAB
...

I think this script can work but I'm not sure :

for c1 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
do
    for c2 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
    do
        for c3 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
        do
            for c4 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
            do
                for c5 in {A,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
                do
                    printf "%s\n" "$c1$c2$c3$c4$c5"
                done
            done
        done
    done
done

And then using and other simple script which every row of the last file generate a peptide with pymol with this command :

for aa in "row1": cmd._alt(string.lower(aa))
save row1.pdb, all

I'm new in scripting to linux. Is anyone can help me please? Thanks


Solution

  • Here's a technique which produces the answer 'fairly fast'. Basically, it starts with a file containing a single newline, and the list of amino acid letters. It generates a sed script (using sed, of course) that successively adds an amino acid letter to the end of a line, prints it, removes it, and moves on to the next letter.

    peptides-A.sh

    printf "%s\n" A D E F G H I K L M N P Q R S T V W Y |
    sed 's%.%s/$/&/p;s/&$//%' > peptides.sed
    echo > peptides.0A      # Bootstrap the process
            sed -n -f peptides.sed peptides.0A > peptides.1A
            sed -n -f peptides.sed peptides.1A > peptides.2A
            sed -n -f peptides.sed peptides.2A > peptides.3A
    timecmd sed -n -f peptides.sed peptides.3A > peptides.4A
    timecmd sed -n -f peptides.sed peptides.4A > peptides.5A
    timecmd sed -n -f peptides.sed peptides.5A > peptides.6A
    timecmd sed -n -f peptides.sed peptides.6A > peptides.7A
    

    You can think of 'timecmd' as a variant of time. It prints the start time, the command, then runs it, and then prints the end time and the elapsed time (wall-clock time only).

    Sample output:

    $ bash peptides-A.sh
    2015-10-16 15:25:24
    + exec sed -n -f peptides.sed peptides.3A
    2015-10-16 15:25:24 - elapsed: 00 00 00
    2015-10-16 15:25:24
    + exec sed -n -f peptides.sed peptides.4A
    2015-10-16 15:25:27 - elapsed: 00 00 03
    2015-10-16 15:25:27
    + exec sed -n -f peptides.sed peptides.5A
    2015-10-16 15:26:16 - elapsed: 00 00 49
    2015-10-16 15:26:16
    + exec sed -n -f peptides.sed peptides.6A
    2015-10-16 15:42:47 - elapsed: 00 16 31
    $ ls -l peptides.?A; rm -f peptides-?A
    -rw-r--r--  1 jleffler  staff           1 Oct 16 15:25 peptides.0A
    -rw-r--r--  1 jleffler  staff          38 Oct 16 15:25 peptides.1A
    -rw-r--r--  1 jleffler  staff        1083 Oct 16 15:25 peptides.2A
    -rw-r--r--  1 jleffler  staff       27436 Oct 16 15:25 peptides.3A
    -rw-r--r--  1 jleffler  staff      651605 Oct 16 15:25 peptides.4A
    -rw-r--r--  1 jleffler  staff    14856594 Oct 16 15:25 peptides.5A
    -rw-r--r--  1 jleffler  staff   329321167 Oct 16 15:26 peptides.6A
    -rw-r--r--  1 jleffler  staff  7150973912 Oct 16 15:42 peptides.7A
    $
    

    I used the script from the question to create peptides.5B (the script was called peptides-B.sh on my disk), and checked that peptides.5A and peptides.5B were identical.

    Test environment: 13" MacBook Pro, 2.7 GHz Intel Core i5, 8 GiB RAM, SSD storage.


    Editing the start of the line instead of the end of the line yields approximately a 20% performance improvement.

    Code:

    printf "%s\n" A D E F G H I K L M N P Q R S T V W Y |
    sed 's%.%s/^/&/p;s/^&//%' > peptides.sed
    echo > peptides.0A      # Bootstrap the process
            sed -n -f peptides.sed peptides.0A > peptides.1A
            sed -n -f peptides.sed peptides.1A > peptides.2A
            sed -n -f peptides.sed peptides.2A > peptides.3A
    timecmd sed -n -f peptides.sed peptides.3A > peptides.4A
    timecmd sed -n -f peptides.sed peptides.4A > peptides.5A
    timecmd sed -n -f peptides.sed peptides.5A > peptides.6A
    timecmd sed -n -f peptides.sed peptides.6A > peptides.7A
    

    Timing:

    $ bash peptides-A.sh; ls -l peptides.?A; wc peptides.?A; rm -f peptides.?A
    2015-10-16 16:05:48
    + exec sed -n -f peptides.sed peptides.3A
    2015-10-16 16:05:48 - elapsed: 00 00 00
    2015-10-16 16:05:48
    + exec sed -n -f peptides.sed peptides.4A
    2015-10-16 16:05:50 - elapsed: 00 00 02
    2015-10-16 16:05:50
    + exec sed -n -f peptides.sed peptides.5A
    2015-10-16 16:06:28 - elapsed: 00 00 38
    2015-10-16 16:06:28
    + exec sed -n -f peptides.sed peptides.6A
    2015-10-16 16:18:51 - elapsed: 00 12 23
    -rw-r--r--  1 jleffler  staff           1 Oct 16 16:05 peptides.0A
    -rw-r--r--  1 jleffler  staff          38 Oct 16 16:05 peptides.1A
    -rw-r--r--  1 jleffler  staff        1083 Oct 16 16:05 peptides.2A
    -rw-r--r--  1 jleffler  staff       27436 Oct 16 16:05 peptides.3A
    -rw-r--r--  1 jleffler  staff      651605 Oct 16 16:05 peptides.4A
    -rw-r--r--  1 jleffler  staff    14856594 Oct 16 16:05 peptides.5A
    -rw-r--r--  1 jleffler  staff   329321167 Oct 16 16:06 peptides.6A
    -rw-r--r--  1 jleffler  staff  7150973912 Oct 16 16:18 peptides.7A
            1         0          1 peptides.0A
           19        19         38 peptides.1A
          361       361       1083 peptides.2A
         6859      6859      27436 peptides.3A
       130321    130321     651605 peptides.4A
      2476099   2476099   14856594 peptides.5A
     47045881  47045881  329321167 peptides.6A
    893871739 893871739 7150973912 peptides.7A
    943531280 943531279 7495831836 total
    $
    

    I tarted up the output from wc so it was 'properly columnar' (adding spaces, in other words). The original started going wonky when the numbers contained 8 digits.