Count the number of occurrences of a char in each sequence in a fasta file with multiple sequences using bash

I want to count the number of occurrences of a char in each sequence in a fasta file with multiple sequences, but with the method I use I count the total of the char in the fasta file:

grep -o 'G' my_sequence.fasta | wc -l

Is there some way to do it with each of the sequences using the fasta file with multiple sequences?

the fasta file look as below

>sequence1
CCGTGGGTCAATCCCGTA
>sequence2
CCGTGGGGCACTCCCGTA
>sequence3
TTGTGGGTCAATCCCGTC
>sequence4
CCCGGGTGCACTCCCGTA

Solution

Here's an awk that counts the number of G in each sequence; it discards the possible header in the FASTA file and supports multi-line sequences. Also, the description lines in the FASTA file might contain more than just the sequence ID, for example >MCHU ‑ Calmodulin …. The code only outputs the ID, i.e. >MCHU.

awk -v char=G '
    /^>/ {
        if (label != "") {
            print label, gsub(char, "", sequence)
            sequence = ""
        }
        label = $1
        next 
    }
    { sequence = sequence $0 }
    END {
        if (label != "")
            print label, gsub(char, "", sequence)
    }
' file.fasta

>sequence1 5
>sequence2 6
>sequence3 5
>sequence4 5

remark: Be aware that the char parameter in argument of the awk command is a regex and that C‑style escape sequences in it will be unescaped; well, it doesn't really matter for your use‑case (i.e. for characters that don't have any special meaning in regex nor C‑escaping).