I'm trying to extract sequences from a database using the following code:
use strict;
use Bio::SearchIO;
use Bio::DB::Fasta;
my ($file, $id, $start, $end) = ("secondround_merged_expanded.fasta","C7136661:0-107",1,10);
my $db = Bio::DB::Fasta->new($file);
my $seq = $db->seq($id, $start, $end);
print $seq,"\n";
Where the header of the sequence I'm trying to extract is: C7136661:0-107
, as in the file:
>C7047455:0-100
TATAATGCGAATATCGACATTCATTTGAACTGTTAAATCGGTAACATAAGCAGCACACCTGGGCAGATAGTAAAGGCATATGATAATAAGCTGGGGGCTA
The code works fine when I switch the header to something more standard (like test
). I'm thinking that BioPerl doesn't like the non-standard heading. Any way to fix this so I don't have to recode the FASTA file?
By default, Bio::DB::Fasta
will use all non-space characters immediately following the >
on the header line to form the key for the sequence. In your case this looks like C7047455:0-100
, which is the same as the built-in abbreviation for a subsequence. As documented here, instead of $db->seq($id, $start, $stop)
you can use $db->seq("$id:$start-$stop")
, so a call to $db->seq('C7136661:0-107')
looks like you are asking for $db->seq('C7136661', 0, 107)
, and that key doesn't exist.
I have no way of knowing what is in your data, but if it is adequate to use just the first part of the header up to the colon as a key then you can use the -makeid
callback to modify the key. Then if you use just C7136661
to retrieve the sequence it will work.
This code demonstrates. Note that you will probably already have a .index
cache file that you must delete before you see any change in behaviour.
use strict;
use warnings;
use Bio::DB::Fasta;
my ($file, $id, $start, $end) = qw(
secondround_merged_expanded.fasta
C7136661
1 10
);
my $db = Bio::DB::Fasta->new($file, -makeid => \&makeid);
sub makeid {
my ($head) = @_;
$head =~ /^>([^:]+)/ or die qq(Invalid header "$head");
$1;
}
my $seq = $db->seq($id, $start, $end);
print $seq, "\n";