Search code examples
arraysperlsuppress-warningsis-empty

perl - How to replace an empty value in an array without using a variable for comparison?


I have an array where each element comes from a line delimited by tab.

Initial code:

#!/usr/bin/perl -w
use strict;

The code below is a piece of the code.

sub parser_domains {

my @params = @_;

my $interpro_line = "";
my @interpro_vector = ( );
my $idr_sub_id = $params[0];
my $idr_sub_start = $params[1]+1;
my $idr_sub_end = $params[2]+1;
my $interpro_id = "";
my $interpro_start_location = 0;
my $interpro_end_location = 0;
my $interpro_db = "";
my $interpro_length = 0;
my $interpro_db_accession = "";
my $interpro_signature = "";
my $interpro_evalue = "";
my $interpro_vector_size = 0;
my $interpro_sub_file= "";
my $idr_sub_lenght = ($idr_sub_end-$idr_sub_start)+1;

$interpro_sub_file = "$result_directory_predictor/"."$idr_sub_id/"."$idr_sub_id".".fsa.tsv";

#open file; if it fails, print a error and exits.
unless( open(TSV_FILE_DATA, $interpro_sub_file) ) {
        print "Cannot open file \"$interpro_sub_file\"\n\n";
        return;
}
my @interpro_file_line = <TSV_FILE_DATA>;
close TSV_FILE_DATA;

foreach $interpro_line (@interpro_file_line) {
    @interpro_vector = split('\t',$interpro_line);
    $interpro_id = $interpro_vector[0];
    $interpro_db = $interpro_vector[3];
    $interpro_db_accession = $interpro_vector[4];
    $interpro_start_location = $interpro_vector[6];
    $interpro_end_location = $interpro_vector[7];
    $interpro_signature = $interpro_vector[11];
    $interpro_length = ($interpro_end_location-$interpro_start_location) + 1;

    if ($interpro_signature eq ""){

            $interpro_signature = "NOPIR";
            printf IDP_REGION_FILE "\nFound a $interpro_db domain with no IPR: starts at $interpro_start_location and ends at $interpro_end_location\n";
            printf IDP_REGION_FILE "The size of $interpro_db domain in the sequence is $interpro_length\n";
            printf IDP_REGION_FILE "The IDR starts at $idr_sub_start and and ends at $idr_sub_end\n";
            printf IDP_REGION_FILE "The size of IDR is $idr_sub_lenght\n";
            domains_regions($idr_sub_start,$idr_sub_end,$interpro_start_location,$interpro_end_location,$interpro_signature,$interpro_length,$interpro_db,$idr_sub_id,$interpro_db_accession,$idr_sub_lenght);
    }
    else{
        for $entry_line (@entry_file_line) {
            @entry_vector =  split('\t',$entry_line);
            $entry_ac = $entry_vector[0];
            $entry_type = $entry_vector[1];
            $entry_name = $entry_vector[2];
            chomp($entry_name);

            if ($interpro_signature eq $entry_ac) {
                printf IDP_REGION_FILE "\nFound a $interpro_db domain with Interpro Signature $entry_ac: starts at $interpro_start_location and ends at $interpro_end_location\n";
                printf IDP_REGION_FILE "The size of $interpro_db domain in the sequence is $interpro_length\n";
                printf IDP_REGION_FILE "The Interpro Signature $entry_ac belongs to type $entry_type\n";
                printf IDP_REGION_FILE "The name of $entry_ac is $entry_name\n";
                printf IDP_REGION_FILE "The IDR starts at $idr_sub_start and ends at $idr_sub_end\n";
                printf IDP_REGION_FILE "The size of IDR is $idr_sub_lenght\n";  

                domains_regions($idr_sub_start,$idr_sub_end,$interpro_start_location,$interpro_end_location,$interpro_signature,$interpro_length,$interpro_db,$idr_sub_id,$interpro_db_accession,$idr_sub_lenght);
            }
        }
    }
}
}

A example of tsv file (interproscan):

P51587  14086411a2cdf1c4cba63020e1622579    3418    Pfam    PF09103 BRCA2, oligonucleotide/oligosaccharide-binding, domain 1    2670    2799    7.9E-43 T   15-03-2013
P51587  14086411a2cdf1c4cba63020e1622579    3418    ProSiteProfiles PS50138 BRCA2 repeat profile.   1002    1036    0.0 T   18-03-2013  IPR002093   BRCA2 repeat    GO:0005515|GO:0006302
P51587  14086411a2cdf1c4cba63020e1622579    3418    Gene3D  G3DSA:2.40.50.140       2966    3051    3.1E-52 T   15-03-2013
...

The scripts works perfectly, but the comparison $interpro_signature eq "" provides a warning.

Use of uninitialized value $interpro_signature in string eq at /home/duca/eclipse-workspace/idps/idp_parser_interpro.pl line 666.

So, I searched and tried manners to replace the empty value into the array before the comparison. I would like the empty value by "NOIPR". I'm working with 9 completed genomes, and I have more than 324000 proteins to parse.

How can I replace the empty values in my array?

Thanks.


Solution

  • Your array may not have 12 elements (or the 12-th element may be undef)

    my $interpro_signature = $interpro_vector[11] // 'some_default_value'; 
    

    The // is the defined-or operator.

    The error Use of uninitialized value means that the variable hasn't been initialized, or it's been set to undef.

    See perldiag and use it regularly. Run code with perl -Mdiagnostics ... on errors, regularly.

    The use warnings; is actually better than -w.


    Update to a substantial edit of the question

    • From shown data it appears that yet other fields may not be given in the file; so proof all variables with defaults, much like for the array element at index 11 above. This is what you want to do in general anyway. For example, if there are all fields in the file but some may be empty (two tabs with nothing in between)

      my @interpro_defaults = ('id_default', 'db_default', ...);
      
      my ($interpro_id, $interpro_db, ...) = 
          map { 
              $interpro_vector[$_] // $interpro_defaults[$_] 
          } 0 .. $#interpro_defaults;
      

      This relies on order (of variables) in the list, what can be error prone with variables; see below. If some fields are simply not there there may be (far) more work to do.

    • There are too many separate variables, all related and named as $interpro_X (and then there are $idr_Y and $entry_Z, but fewer and perhaps manageable).

    Can you not bundle them in a container-type variable or a data structure?

    A hash %interpro seems suitable, with keys X (so, $interpro{id} etc). Then you can use them more easily and can perform some actions on the whole lot. You still have to watch for order when initializing since they are read sequentially, but it should be clearer this way. For example

    my @interpro_vars   = qw(id db db_accesssion ...);
    my @interpro_vector = qw(id_default db_default ...);
    my %interpro;
    @interpro{@interpro_vars} = @interpro_vector;
    # or simply
    @interpro{qw(id db ...)} = qw(id_default db_default ...);
    

    I've defined arrays with keys and values first and then used them, in case that you may want to later have those lists in arrays. If that's not the case you can initialize the hash with lists (the last line).

    Here

    my %h; 
    @h{LIST-keys} = LIST-values;
    

    is a way to assign the list of LIST-values to the set of keys of the hash %h given in LIST-keys. They are assigned one for one, in the given order of both lists (which had better match in size). There is the @ sigil in front of hash's keys since we are having a list (of keys) there, not a hash. Note that the hash must have been declared somewhere. See slices in perldata.