Search code examples
bashperlawkgrepfasta

Count specific character per every species of a Fasta file


I have been trying to find the amount of 1s per each species in a fasta file that looks like this:

>111
1100101010
>102
1110000001

The desired output would be:

>111
5
>102
4

I know how to get the numbers of 1s in a file with:

grep -c 1 file

My problem is that I cannot find the way to keep track of the number of 1s per each species (instead of the total in the file).


Solution

  • >111
    11001010101110000001
    

    can also be written as

    >111
    1100101010
    1110000001
    

    but none of the existing solutions work for the latter. This addresses that oversight:

    perl -Mv5.10 -ne'
       if ( /^>/ ) {
          say $c if defined $c;
          $c = 0;
          print;
       } else {
          $c += tr/1//;
       }
       END {
          say $c if defined $c;
       }
    ' file.fasta
    

    For both files show above, the program outputs

    >111
    9