I have an array where each element comes from a line delimited by tab.
Initial code:
#!/usr/bin/perl -w
use strict;
The code below is a piece of the code.
sub parser_domains {
my @params = @_;
my $interpro_line = "";
my @interpro_vector = ( );
my $idr_sub_id = $params[0];
my $idr_sub_start = $params[1]+1;
my $idr_sub_end = $params[2]+1;
my $interpro_id = "";
my $interpro_start_location = 0;
my $interpro_end_location = 0;
my $interpro_db = "";
my $interpro_length = 0;
my $interpro_db_accession = "";
my $interpro_signature = "";
my $interpro_evalue = "";
my $interpro_vector_size = 0;
my $interpro_sub_file= "";
my $idr_sub_lenght = ($idr_sub_end-$idr_sub_start)+1;
$interpro_sub_file = "$result_directory_predictor/"."$idr_sub_id/"."$idr_sub_id".".fsa.tsv";
#open file; if it fails, print a error and exits.
unless( open(TSV_FILE_DATA, $interpro_sub_file) ) {
print "Cannot open file \"$interpro_sub_file\"\n\n";
return;
}
my @interpro_file_line = <TSV_FILE_DATA>;
close TSV_FILE_DATA;
foreach $interpro_line (@interpro_file_line) {
@interpro_vector = split('\t',$interpro_line);
$interpro_id = $interpro_vector[0];
$interpro_db = $interpro_vector[3];
$interpro_db_accession = $interpro_vector[4];
$interpro_start_location = $interpro_vector[6];
$interpro_end_location = $interpro_vector[7];
$interpro_signature = $interpro_vector[11];
$interpro_length = ($interpro_end_location-$interpro_start_location) + 1;
if ($interpro_signature eq ""){
$interpro_signature = "NOPIR";
printf IDP_REGION_FILE "\nFound a $interpro_db domain with no IPR: starts at $interpro_start_location and ends at $interpro_end_location\n";
printf IDP_REGION_FILE "The size of $interpro_db domain in the sequence is $interpro_length\n";
printf IDP_REGION_FILE "The IDR starts at $idr_sub_start and and ends at $idr_sub_end\n";
printf IDP_REGION_FILE "The size of IDR is $idr_sub_lenght\n";
domains_regions($idr_sub_start,$idr_sub_end,$interpro_start_location,$interpro_end_location,$interpro_signature,$interpro_length,$interpro_db,$idr_sub_id,$interpro_db_accession,$idr_sub_lenght);
}
else{
for $entry_line (@entry_file_line) {
@entry_vector = split('\t',$entry_line);
$entry_ac = $entry_vector[0];
$entry_type = $entry_vector[1];
$entry_name = $entry_vector[2];
chomp($entry_name);
if ($interpro_signature eq $entry_ac) {
printf IDP_REGION_FILE "\nFound a $interpro_db domain with Interpro Signature $entry_ac: starts at $interpro_start_location and ends at $interpro_end_location\n";
printf IDP_REGION_FILE "The size of $interpro_db domain in the sequence is $interpro_length\n";
printf IDP_REGION_FILE "The Interpro Signature $entry_ac belongs to type $entry_type\n";
printf IDP_REGION_FILE "The name of $entry_ac is $entry_name\n";
printf IDP_REGION_FILE "The IDR starts at $idr_sub_start and ends at $idr_sub_end\n";
printf IDP_REGION_FILE "The size of IDR is $idr_sub_lenght\n";
domains_regions($idr_sub_start,$idr_sub_end,$interpro_start_location,$interpro_end_location,$interpro_signature,$interpro_length,$interpro_db,$idr_sub_id,$interpro_db_accession,$idr_sub_lenght);
}
}
}
}
}
A example of tsv file (interproscan):
P51587 14086411a2cdf1c4cba63020e1622579 3418 Pfam PF09103 BRCA2, oligonucleotide/oligosaccharide-binding, domain 1 2670 2799 7.9E-43 T 15-03-2013
P51587 14086411a2cdf1c4cba63020e1622579 3418 ProSiteProfiles PS50138 BRCA2 repeat profile. 1002 1036 0.0 T 18-03-2013 IPR002093 BRCA2 repeat GO:0005515|GO:0006302
P51587 14086411a2cdf1c4cba63020e1622579 3418 Gene3D G3DSA:2.40.50.140 2966 3051 3.1E-52 T 15-03-2013
...
The scripts works perfectly, but the comparison $interpro_signature eq ""
provides a warning.
Use of uninitialized value $interpro_signature in string eq at /home/duca/eclipse-workspace/idps/idp_parser_interpro.pl line 666.
So, I searched and tried manners to replace the empty value into the array before the comparison. I would like the empty value by "NOIPR". I'm working with 9 completed genomes, and I have more than 324000 proteins to parse.
How can I replace the empty values in my array?
Thanks.
Your array may not have 12 elements (or the 12-th element may be undef
)
my $interpro_signature = $interpro_vector[11] // 'some_default_value';
The //
is the defined-or operator.
The error Use of uninitialized value
means that the variable hasn't been initialized, or it's been set to undef
.
See perldiag and use it regularly. Run code with perl -Mdiagnostics ...
on errors, regularly.
The use warnings;
is actually better than -w
.
Update to a substantial edit of the question
From shown data it appears that yet other fields may not be given in the file; so proof all variables with defaults, much like for the array element at index 11
above. This is what you want to do in general anyway. For example, if there are all fields in the file but some may be empty (two tabs with nothing in between)
my @interpro_defaults = ('id_default', 'db_default', ...);
my ($interpro_id, $interpro_db, ...) =
map {
$interpro_vector[$_] // $interpro_defaults[$_]
} 0 .. $#interpro_defaults;
This relies on order (of variables) in the list, what can be error prone with variables; see below. If some fields are simply not there there may be (far) more work to do.
There are too many separate variables, all related and named as $interpro_X
(and then there are $idr_Y
and $entry_Z
, but fewer and perhaps manageable).
Can you not bundle them in a container-type variable or a data structure?
A hash %interpro
seems suitable, with keys X
(so, $interpro{id}
etc). Then you can use them more easily and can perform some actions on the whole lot. You still have to watch for order when initializing since they are read sequentially, but it should be clearer this way. For example
my @interpro_vars = qw(id db db_accesssion ...);
my @interpro_vector = qw(id_default db_default ...);
my %interpro;
@interpro{@interpro_vars} = @interpro_vector;
# or simply
@interpro{qw(id db ...)} = qw(id_default db_default ...);
I've defined arrays with keys and values first and then used them, in case that you may want to later have those lists in arrays. If that's not the case you can initialize the hash with lists (the last line).
Here
my %h;
@h{LIST-keys} = LIST-values;
is a way to assign the list of LIST-values
to the set of keys of the hash %h
given in LIST-keys
. They are assigned one for one, in the given order of both lists (which had better match in size). There is the @
sigil in front of hash's keys since we are having a list (of keys) there, not a hash. Note that the hash must have been declared somewhere. See slices in perldata.