I want to count the number of occurrences of a char in each sequence in a fasta file with multiple sequences, but with the method I use I count the total of the char in the fasta file:
grep -o 'G' my_sequence.fasta | wc -l
Is there some way to do it with each of the sequences using the fasta file with multiple sequences?
the fasta file look as below
Here's an awk
that counts the number of G
in each sequence; it discards the possible header in the FASTA file and supports multi-line sequences. Also, the description lines in the FASTA file might contain more than just the sequence ID, for example >MCHU ‑ Calmodulin …
. The code only outputs the ID, i.e. >MCHU
awk -v char=G '
/^>/ {
if (label != "") {
print label, gsub(char, "", sequence)
sequence = ""
label = $1
{ sequence = sequence $0 }
if (label != "")
print label, gsub(char, "", sequence)
' file.fasta
>sequence1 5
>sequence2 6
>sequence3 5
>sequence4 5
remark: Be aware that the char
parameter in argument of the awk
command is a regex and that C‑style escape sequences in it will be unescaped; well, it doesn't really matter for your use‑case (i.e. for characters that don't have any special meaning in regex nor C‑escaping).