I'm trying to parse a GBK file. Basically, I need to return the locus tag and product name of genes that match the pattern. So if the motif I want to search for all predicted gene product, the search word "predicted" would return:
/product="predicted semialdehyde dehydrogenase"
/locus_tag="ECDH10B_2481"
I've been able to return the /product
but I can't figure out how to parse "backwards" to grab the /locus_tag
.
Here's what I have so far:
my $fasta_file = 'example.txt';
open(INPUT, $fasta_file) || die "ERROR: can't read input FASTA file: $!";
while ( <INPUT> ) {
if(/predicted/){
print $_;
}
}
> example.txt
gene complement(2525423..2526436)
/gene="usg"
/locus_tag="ECDH10B_2481"
CDS complement(2525423..2526436)
/gene="usg"
/locus_tag="ECDH10B_2481"
/codon_start=1
/transl_table=11
/product="predicted semialdehyde dehydrogenase"
/protein_id="ACB03477.1"
/db_xref="GI:169889770"
/db_xref="ASAP:AEC-0002184"
/translation="MSEGWNIAVLGATGAVGEALLETLAERQFPVGEIYALARNESAG
EQL"
gene complement(2526502..2527638)
/gene="pdxB"
/locus_tag="ECDH10B_2482"
CDS complement(2526502..2527638)
/gene="pdxB"
/locus_tag="ECDH10B_2482"
/codon_start=1
/transl_table=11
/product="erythronate-4-phosphate dehydrogenase"
/protein_id="ACB03478.1"
/db_xref="GI:169889771"
/db_xref="ASAP:AEC-0002185"
/translation="MKILVDENMPYARDLFSRLGEVTAVPGRPIPVAQLADADALMVR
SVTKVNESLLAGKPIKFVGTATAGTDHVDEAWLKQAGIGFSAAP"
Just remember the last locus tag encountered and print it if predicted:
#!/usr/bin/perl
use warnings;
use strict;
my $fasta_file = 'example.txt';
open my $INPUT, '<', $fasta_file or die "ERROR: can't read input FASTA file: $!";
my $locus_tag;
while (<$INPUT>) {
if (/locus_tag/) {
$locus_tag = $_;
} elsif (/predicted/) {
print;
print $locus_tag;
}
}