Search code examples
perlbioinformaticsbioperl

Extracting DNA sequences from FASTA file with BioPerl with non-standard header


I'm trying to extract sequences from a database using the following code:

use strict;
use Bio::SearchIO; 
use Bio::DB::Fasta;


my ($file, $id, $start, $end) = ("secondround_merged_expanded.fasta","C7136661:0-107",1,10);
my $db = Bio::DB::Fasta->new($file);
my $seq = $db->seq($id, $start, $end);
print $seq,"\n";

Where the header of the sequence I'm trying to extract is: C7136661:0-107, as in the file:

>C7047455:0-100
TATAATGCGAATATCGACATTCATTTGAACTGTTAAATCGGTAACATAAGCAGCACACCTGGGCAGATAGTAAAGGCATATGATAATAAGCTGGGGGCTA

The code works fine when I switch the header to something more standard (like test). I'm thinking that BioPerl doesn't like the non-standard heading. Any way to fix this so I don't have to recode the FASTA file?


Solution

  • By default, Bio::DB::Fasta will use all non-space characters immediately following the > on the header line to form the key for the sequence. In your case this looks like C7047455:0-100, which is the same as the built-in abbreviation for a subsequence. As documented here, instead of $db->seq($id, $start, $stop) you can use $db->seq("$id:$start-$stop"), so a call to $db->seq('C7136661:0-107') looks like you are asking for $db->seq('C7136661', 0, 107), and that key doesn't exist.

    I have no way of knowing what is in your data, but if it is adequate to use just the first part of the header up to the colon as a key then you can use the -makeid callback to modify the key. Then if you use just C7136661 to retrieve the sequence it will work.

    This code demonstrates. Note that you will probably already have a .index cache file that you must delete before you see any change in behaviour.

    use strict;
    use warnings;
    
    use Bio::DB::Fasta;
    
    my ($file, $id, $start, $end) = qw(
      secondround_merged_expanded.fasta
      C7136661
      1 10
    );
    
    my $db = Bio::DB::Fasta->new($file, -makeid => \&makeid);
    
    sub makeid {
      my ($head) = @_;
      $head =~ /^>([^:]+)/ or die qq(Invalid header "$head");
      $1;
    }
    
    my $seq = $db->seq($id, $start, $end);
    print $seq, "\n";