Search code examples
linuxbashgrepfasta

Count the number of occurrences of a char in each sequence in a fasta file with multiple sequences using bash


I want to count the number of occurrences of a char in each sequence in a fasta file with multiple sequences, but with the method I use I count the total of the char in the fasta file:

grep -o 'G' my_sequence.fasta | wc -l

Is there some way to do it with each of the sequences using the fasta file with multiple sequences?

the fasta file look as below

>sequence1
CCGTGGGTCAATCCCGTA
>sequence2
CCGTGGGGCACTCCCGTA
>sequence3
TTGTGGGTCAATCCCGTC
>sequence4
CCCGGGTGCACTCCCGTA

Solution

  • Here's an awk that counts the number of G in each sequence; it discards the possible header in the FASTA file and supports multi-line sequences. Also, the description lines in the FASTA file might contain more than just the sequence ID, for example >MCHU ‑ Calmodulin …. The code only outputs the ID, i.e. >MCHU.

    awk -v char=G '
        /^>/ {
            if (label != "") {
                print label, gsub(char, "", sequence)
                sequence = ""
            }
            label = $1
            next 
        }
        { sequence = sequence $0 }
        END {
            if (label != "")
                print label, gsub(char, "", sequence)
        }
    ' file.fasta
    
    >sequence1 5
    >sequence2 6
    >sequence3 5
    >sequence4 5
    

    remark: Be aware that the char parameter in argument of the awk command is a regex and that C‑style escape sequences in it will be unescaped; well, it doesn't really matter for your use‑case (i.e. for characters that don't have any special meaning in regex nor C‑escaping).