i have a protein sequence file i want to count trimers in it using sed or grep

I have a protein sequence file in the following format

uniprotID\space\sequence

sequence is a string of any length but with only 20 allowed letters i.e.

ARNDCQEGHILKMFPSTWYV

Example of 1 record

Q5768D AKCCACAKCCAC

I want to create a csv file in the following format

Q5768D     

12
ACA 1
AKC 2
CAC 2
CAK 1
CCA 2
KCC 2

This is what I'm currently trying:

#!/bin/sh
while read ID SEQ # uniprot along with sequences
do
echo $SEQ | tr -d '[[:space:]]' | sed 's/./& /g'  > TEST_FILE
declare -a SSA=(`cat TEST_FILE`)
SQL=$(echo ${#SSA[@]})
  for (( X=0; X <= "$SQL"; X++ ))
      do
         Y=$(expr $X + 1)
         Z=$(expr $X + 2)
         echo ${SSA[X]} ${SSA[Y]} ${SSA[Z]}
     done  | awk '{if (NF == 3) print}' | tr -d ' ' > TEMPTRIMER
rm TEST_FILE # removing temporary sequence file
sort TEMPTRIMER|uniq -c > $ID.$SQL
done < $1

in this code i am storing individual record in a different file which is not good. Also the program is very slow in 12 hours only 12000 records are accessed out of .5 million records.

Solution

This perl script processes cca 550'000 "trimmers"/sec. (random valid test sequences 0-8000 chars long, 100k records (~400MB) produce an 2GB output csv)

output:

Q1024A;421;AAF=1;AAK=1;AFC=1;AFE=2;AGP=1;AHC=1;AHE=1;AIV=1;AKN=1;AMC=1;AQD=1;AQY=1;...
Q1074F;6753;AAA=1;AAD=1;AAE=1;AAF=2;AAN=2;AAP=2;AAT=1;ACA=1;ACC=1;ACD=1;ACE=3;ACF=2;...

code:

#!/usr/bin/perl
use strict;
$|=1;
my $c;

# process each line on input
while (readline STDIN) {
  $c++; chomp;
  # is it a valid line? has the format and a sequence to process
  if (m~^(\w+)\s+([ARNDCQEGHILKMFPSTWYV]+)\r?$~ and $2) {
    print join ";",($1,length($2));
    my %trimdb;
    my $seq=$2;
    #split the sequence into chars
    my @a=split //,$seq;
    my @trimmer;

    # while there are unprocessed chars in the sequence...
    while (scalar @a) {

      # fill up the buffer with a char from the top of the sequence
      push @trimmer, shift @a;

      # if the buffer is full (has 3 chars), increase the trimer frequency
      if (scalar @trimmer == 3 ) {
        $trimdb{(join "",@trimmer)}++;
        # drop the first letter from buffer, for next loop
        shift @trimmer;
      }
    }

    # we're done with the sequence - print the sorted list of trimers
    foreach (sort keys %trimdb) {

      #print in a csv (;) line
      print ";$_=$trimdb{$_}";
    }
    print"\n";
  }
  else {
    #the input line was not valid.
    print STDERR "input error: $_\n";
  }
  # just a progress counter
  printf STDERR "%8i\r",$c if not $c%100;
}
print STDERR "\n";

if you have perl installed (most linuxes do, check the path /usr/bin/perl or replace with yours), just run: ./count_trimers.pl < your_input_file.txt > output.csv