Search code examples
bioinformaticsfastabioperl

Bioperl reading fasta sequences


I found that if my fasta file ends with a single line sequence then that sequence returned by Bioperl will have one nucleotide missing. If fasta file ends with the new line then it returns complete sequence. Don't understand why? Is this a requirement for fasta files to end with an empty new line?

This the code I am using

my $obj    = $db->get_Seq_by_id($id);
my $seq    = $obj->seq; # returns 36 or 35 nucleotides depending if last new line exists 
my $length = $obj->length; # returns 36 or 35

And the fasta sequence:

gi|37423|emb|X04588.1| Human 2.5 kb mRNA for cytoskeletal tropomyosin TM30(nm) CCCTTTAAATTTCCCTTTAAATTTCCCTTTAAATTTT


Solution

  • You should check that your fasta file has an even number of lines: wc -l file.fasta.

    It is a requirement that for each line in your fasta file, there must be an end of current line character: $. If you use the vi editor, type :set list to reveal these hidden characters. Alternatively, try: cat -A file.fasta to see the line endings.

    Also, to be a true fasta file, your header line should begin with the > character.


    Perhaps it's not so much the evenness of lines, but rather if the last line in the file contains a newline ending. If this:

    cat -A fasta.file | awk 'END { print substr ($0, length, 1) }'
    

    does not return a dollar sign ($), then you may have problems using your fasta file.


    To replicate the problem, you can remove the last newline character from a 'good' (even lined) fasta file with this:

    perl -i -pe 'chomp if eof' fasta.file
    

    And you can add a newline to the end of your file with this:

    perl -i -ne 'chomp; print "$_\n"' fasta.file