Search code examples
bashsedgrepprotein-database

i have a protein sequence file i want to count trimers in it using sed or grep


I have a protein sequence file in the following format

uniprotID\space\sequence

sequence is a string of any length but with only 20 allowed letters i.e.

ARNDCQEGHILKMFPSTWYV

Example of 1 record

Q5768D AKCCACAKCCAC

I want to create a csv file in the following format

Q5768D     

12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2

This is what I'm currently trying:

#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g'  > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[@]})
  for (( X=0; X <= "$SQL"; X++ ))
      do
         Y=$(expr $X + 1)
         Z=$(expr $X + 2)
         echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
     done  | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1

in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.


Solution

  • This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)

    output:

    Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
    Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...
    

    code:

    #!/usr/bin/perl
    use strict;
    $|=1;
    my $c;
    
    # process each line on input
    while (readline STDIN) {
      $c++; chomp;
      # is it a valid line? has the format and a sequence to process
      if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
        print join ";",($1,length($2));
        my %trimdb;
        my $seq=$2;
        #split the sequence into chars
        my @a=split //,$seq;
        my @trimmer;
    
        # while there are unprocessed chars in the sequence...
        while (scalar @a) {
    
          # fill up the buffer with a char from the top of the sequence
          push @trimmer, shift @a;
    
          # if the buffer is full (has 3 chars), increase the trimer frequency
          if (scalar @trimmer == 3 ) {
            $trimdb{(join "",@trimmer)}++;
            # drop the first letter from buffer, for next loop
            shift @trimmer;
          }
        }
    
        # we're done with the sequence - print the sorted list of trimers
        foreach (sort keys %trimdb) {
    
          #print in a csv (;) line
          print ";$_=$trimdb{$_}";
        }
        print"\n";
      }
      else {
        #the input line was not valid.
        print STDERR "input error: $_\n";
      }
      # just a progress counter
      printf STDERR "%8i\r",$c if not $c%100;
    }
    print STDERR "\n";
    

    if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv