Search code examples
perldata-structureshashtableassociative-arrayperl-data-structures

Adding a new element to an array in a hash of arrays


I want to read and save the content of a file in a hash of array. The first column of each row would be the key. Then, I want to read the files in a directory and add the file name to the end of the array according to the key!

file ($file_info)

AANB    John    male
S00V    Sara    female
SBBA    Anna    female

files in the directory:

AANB.txt
SBBA.txt
S00V.txt

expected output:

AANB    John    male    AANB.txt
S00V    Sara    female  S00V.txt
SBBA    Anna    female  SBBA.txt

Here's the script itself:

#!/usr/bin/perl

use strict;
use warnings;

my %all_samples=();
my $file_info = $ARGV[0];

open(FH, "<$file_info");

while(<FH>) {
    chomp;
    my @line = split("\t| ", $_);

    push(@{$all_samples{$line[0]}}, $_);
}

my $dir = ".";
opendir(DIR, $dir);
my @files = grep(/\.txt$/,readdir(DIR));
closedir(DIR);

foreach my $file (@files) {
    foreach my $k (keys %all_samples){
        foreach my $element (@{ $all_samples{$k} }){
            my @element = split(' ', $element);
            if ($file =~ m/$element[0]/) {
                push @{$all_samples{$element}}, $file;
            }
            else {
                next;
            }
        }
    }

}

foreach my $k (keys %all_samples) {
    foreach my $element (@{ $all_samples{$k} }) {
        print $element,"\n";
    }
}

But the output is not what I expected

AANB    John    male
SBBA.txt1
S00V    Sara    female
SBBA    Anna    female
S00V.txt1
AANB.txt1

Solution

  • I think that

            my @element = split(' ', $element);
            if ($file =~ m/$element[0]/) {
                push @{$all_samples{$element}}, $file;
            }
    

    Is not doing the right thing, so $all_samples{$element}} is a new arrayref. You're printing six one element arrays rather than three two element arrays.

    But then it doesn't help that you're printing the array elements one per line.

    I think that your final section should look more like this:

    foreach my $k (keys %all_samples) {
        print join( "\t", @{ $all_samples{$k} } ) . "\n"
    }
    

    In general, I think that you're overcomplicating this script. Here's how I would write it:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my $all_samples={};
    
    while(<>) {
        chomp;
        # Note that I'm using variable names here to document
        # The format of the file being read. This makes for
        # easier trouble-shooting -- if a column is missing,
        # It's easier to tell that $file_base_name shouldn't be
        # 'Anna' than that $line[0] should not be 'Anna'.
        my ( $file_base_name, $given_name, $sex ) = split("\t", $_);
        push(@{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
    }
    
    my $dir = ".";
    opendir(DIR, $dir);
    my @files = grep(/\.txt$/,readdir(DIR));
    closedir(DIR);
    
    FILE: foreach my $file (@files) {
        BASE: foreach my $base (keys %{$all_samples}){
            next BASE unless( $file =~ /$base/ );
            push @{$all_samples->{$base}}, $file;
        }
    }
    
    foreach my $k (keys %{$all_samples} ) {
        print join( "\t", @{ $all_samples->{$k} } ) . "\n";
    }
    

    I prefer hashrefs to hashes, simply because I tend to deal with nested data structures -- I'm simply more used to seeing $all_samples->{$k} than $all_samples{$k}... more importantly, I'm using the full power of the arrayref, meaning that I'm not having to re-split the array that's already been split once.

    G. Cito brings up an interesting point: why did I use

    push(@{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
    

    Rather than

    push(@{$all_samples->{$file_base_name} }, [ $file_base_name, $given_name, $sex ] );
    

    There's nothing syntactically wrong with the latter, but it wasn't what I was trying to accomplish:

    Let's look at what $all_samples->{$base} would look like after

    push @{$all_samples->{$base}}, $file;
    

    If the original push had been this:

    push(@{$all_samples->{$file_base_name} }, [ $file_base_name, $given_name, $sex ] );
    

    @{$all_samples->{$base}} would look like this:

    (
        [ $file_base_name, $given_name, $sex ],
        $file
    )
    

    If instead, we use

    push(@{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
    

    @{$all_samples->{$base}} looks like this after push @{$all_samples->{$base}}, $file:

    (
        $file_base_name, 
        $given_name, 
        $sex, 
        $file
    )
    

    For instance:

    (
        "AANB",
        "John",   
        "male",    
        "AANB.txt"
    )
    

    So when we print the array:

    print join( "\t", @{ $all_samples->{$k} } ) . "\n";
    

    Will print

    AANB    John    male    AANB.txt