Search code examples
arraysperlsubroutinebioperl

Perl: array goes empty after passing it to a function?


I'm working on a project that's scalating a lot, lately, and I'm re-writing code to make it more OOP and passing all redundant code into sub-routines.

The script checks whether a gene exists in the database (through various means) or not. It may also report possible duplicates. Before reporting a duplicate, the script makes sure it's not a "biological duplicate" (essentially the same biological data but a with different position in the genome and, hence, not an actual duplicate). In order to do so...

 my @gene_ids;
 my @gene_names;                                

 while(my $gene = $geners_bychecksum->next){

        my $gene_name = $gene->gene_name;
        my $gene_id = $gene->gene_id;

        push @gene_ids, $gene_id;
        push @gene_names, $gene_name;


    }

    print STDERR "$id\tJ\tALERT CHECKSUM MULTI-HIT\t(".join(",",@gene_names).")\n"; 
    my $solve_multihit = solve_multihit($id, \@gene_names, \@gene_ids, $spc, $species_directory, $dataset);
    print STDERR "$id\tJ\tALERT CHECKSUM MULTI-HIT\t(".join(",",@gene_names).")\n"; 

    if($solve_multihit){

        print STDERR "$id\tM\tUPDATE \n";   
        print $report "$id\tM\tUPDATE \n";  
        $countM++;                                                                              

    } else {

        print STDERR "$id\tJ\tALERT CHECKSUM MULTI-HIT\t(".join(",",@gene_names).")\n"; 

    }

Here, $geners_bychecksum is a DBIC resulset with database hits from a prior search and, for this case-scenario, it always has more than 1 gene. The $id,$spc,$species_directory and $dataset are all strings that come from the config and are defined above this chunk.

The solve_multihit subroutine is a rather complicated function that tries to resolve whether the multi-hits are actual duplicates or biological duplicates. Notice that I'm passing the @gene_names and @gene_ids arrays to this function. This function will return the gene_id of the proper gene, if it was able to solve the discrepancy; or 0 if not. Simplified code for the sub can be found in the following link

https://codeshare.io/2EM8qN

THE ACTUAL QUESTION

You may have noticed that the

print STDERR "$id\tJ\tALERT CHECKSUM MULTI HIT\t(".join(",",@gene_names).")\n";

is both before and after the solve_multihit subroutine call... and the array seems to go empty after running the function, according to the STDERR:

BBOV_I005030    J   ALERT CHECKSUM MULTI-HIT    (XP_001609152.1,XP_001609157.1)
BBOV_I005030    J   ALERT CHECKSUM MULTI-HIT    ()
BBOV_I005040    J   ALERT CHECKSUM MULTI-HIT    (XP_001609156.1,XP_001609153.1)
BBOV_I005040    J   ALERT CHECKSUM MULTI-HIT    ()
BBOV_I005050    J   ALERT CHECKSUM MULTI-HIT    (XP_001609154.1,XP_001609155.1)
BBOV_I005050    J   ALERT CHECKSUM MULTI-HIT    ()
BBOV_I005060    J   ALERT CHECKSUM MULTI-HIT    (XP_001609154.1,XP_001609155.1)
BBOV_I005060    J   ALERT CHECKSUM MULTI-HIT    ()
BBOV_I005070    J   ALERT CHECKSUM MULTI-HIT    (XP_001609156.1,XP_001609153.1)
BBOV_I005070    J   ALERT CHECKSUM MULTI-HIT    ()
BBOV_I005080    J   ALERT CHECKSUM MULTI-HIT    (XP_001609152.1,XP_001609157.1)
BBOV_I005080    J   ALERT CHECKSUM MULTI-HIT    ()

Why would that happen? I'm pretty sure I could solve it by returning the arrays along with the results of the solve_multihit{} sub, but I wonder why would it go empty.

PS: The J in the report is just a case-scenario key code.


Solution

  • I can see two ways for your code to accomplish the data removal that it seems to be doing.

    The function arguments available in @_ are aliased to data passed to it. So if you change @_ itself (or its elements) you change the data outside of the function.

    More likely, as you are passing by reference, your sub probably works directly with it

    sub ff {
        my ($rary) = @_;
        @$rary = ();
    }
    
    my @data = 1..4;
    
    ff(\@data);
    
    say for @data;  # empty
    

    If your processing needs to change the array it works with then make a local copy first

    sub ff { 
        my ($rary) = @_;
        my @local_ary = @$ary;
        # now changes to @local_ary do not affect @data in the caller
    }
    

    This is generally safer, while it does introduce a data copy which doesn't happen when working with the reference.


    The edit together with ikegami's answer clears this up: splice is destructive to the array it works with and here by curious syntax it's fed an anonymous array formed out of a dereferenced @_ argument, whereby it changes the data in the caller.

    There is no reason for splice in what you do. Its purpose is to change the array.

    Instead, use arrayrefs that are passed to the sub

    sub solve_multihit {
        my ($id, $gene_names, $gene_ids, ...) = @_;
        foreach my $name (@$gene_names) {
            ...
        }
        ...
    }
    

    or make a local copy if you wish

    sub solve_multihit { 
        my $id = shift;
        my @gene_names = @{ shift @_ };
        ...
    }
    

    where my @gene_names is a lexical variable in this scope (the sub in your case ) and changes to it do not affect the one with the same name in the calling scope.