Basically, a GenBank file consists on gene entries (announced by 'gene' followed by its corresponding 'CDS' entry (only one per gene) like the two I show here below. I would like to get locus_tag vs product in a tab-delimited two column file. 'gene' and 'CDS' are always preceded and followed by spaces.
A previous question suggested a script.
The problem is that it seems that because 'product' has sometimes '/' character inside its name, its having conflicts with this script, that, as far as I can understand, is using '/' as field separator to store information in an array?
I would like to solve this, either modifying this script or building other one.
perl -nE'
BEGIN{ ($/, $") = ("CDS", "\t") }
say "@r[0,1]" if @r= m!/(?:locus_tag|product)="(.+?)"!g and @r>1
' file
gene complement(8972..9094)
/locus_tag="HAPS_0004"
/db_xref="GeneID:7278619"
CDS complement(8972..9094)
/locus_tag="HAPS_0004"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="YP_002474657.1"
/db_xref="GI:219870282"
/db_xref="GeneID:7278619"
/translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR"
gene 68..637
/locus_tag="HPNK_00040"
CDS 68..637
/locus_tag="HPNK_00040"
/codon_start=1
/transl_table=11
/product="NinG recombination protein/bacteriophage lambda
NinG family protein"
/protein_id="CRESA:HPNK_00040"
/translation="MIKPKVKKRKCKCCGGEFKSADSFRKWCSAECGVKLAKIAQEKA
RQKAIEKRNREERAKIKATRERLKSRSEWLKDAQAIFNEYIRLRDKDEPCISCRRFHQ
GQYHAGHYRTVKAMPELRFNEDNVHKQCSACNNHLSGNITEYRINLVRKIGAERVEAL
ESYHPPVKWSVEDCKEIIKTYRAKIKELK"
As your sample GenBank file was incomplete, I went online to find a sample file that could be used in an example, and I found this file.
Using this code and the Bio::GenBankParser
module, it was parsed guessing what parts of the structure you were after. In this case, "features" that contained both a locus_tag
field and a product
field.
use strict;
use warnings;
use feature 'say';
use Bio::GenBankParser;
my $file = shift;
my $parser = Bio::GenBankParser->new( file => $file );
while ( my $seq = $parser->next_seq ) {
my $feat = $seq->{'FEATURES'};
for my $f (@$feat) {
my $tag = $f->{'feature'}{'locus_tag'};
my $prod = $f->{'feature'}{'product'};
if (defined $tag and defined $prod) {
say join "\t", $tag, $prod;
}
}
}
Usage:
perl script.pl input.txt > output.txt
Output:
MG_001 DNA polymerase III, beta subunit
MG_470 CobQ/CobB/MinD/ParA nucleotide binding domain-containing protein
The output from your one-liner for the same input would be:
MG_001 DNA polymerase III, beta subunit
MG_470 CobQ/CobB/MinD/ParA nucleotide binding
domain-containing protein
Assuming of course that you add the /s
modifier to the regex to account for multiline entries (which leeduhem pointed out in the comments):
m!/(?:locus_tag|product)="(.+?)"!sg
# ^---- this