I want to read and save the content of a file in a hash of array. The first column of each row would be the key. Then, I want to read the files in a directory and add the file name to the end of the array according to the key!
file ($file_info)
AANB John male
S00V Sara female
SBBA Anna female
files in the directory:
AANB.txt
SBBA.txt
S00V.txt
expected output:
AANB John male AANB.txt
S00V Sara female S00V.txt
SBBA Anna female SBBA.txt
Here's the script itself:
#!/usr/bin/perl
use strict;
use warnings;
my %all_samples=();
my $file_info = $ARGV[0];
open(FH, "<$file_info");
while(<FH>) {
chomp;
my @line = split("\t| ", $_);
push(@{$all_samples{$line[0]}}, $_);
}
my $dir = ".";
opendir(DIR, $dir);
my @files = grep(/\.txt$/,readdir(DIR));
closedir(DIR);
foreach my $file (@files) {
foreach my $k (keys %all_samples){
foreach my $element (@{ $all_samples{$k} }){
my @element = split(' ', $element);
if ($file =~ m/$element[0]/) {
push @{$all_samples{$element}}, $file;
}
else {
next;
}
}
}
}
foreach my $k (keys %all_samples) {
foreach my $element (@{ $all_samples{$k} }) {
print $element,"\n";
}
}
But the output is not what I expected
AANB John male
SBBA.txt1
S00V Sara female
SBBA Anna female
S00V.txt1
AANB.txt1
I think that
my @element = split(' ', $element);
if ($file =~ m/$element[0]/) {
push @{$all_samples{$element}}, $file;
}
Is not doing the right thing, so $all_samples{$element}}
is a new arrayref. You're printing six one element arrays rather than three two element arrays.
But then it doesn't help that you're printing the array elements one per line.
I think that your final section should look more like this:
foreach my $k (keys %all_samples) {
print join( "\t", @{ $all_samples{$k} } ) . "\n"
}
In general, I think that you're overcomplicating this script. Here's how I would write it:
#!/usr/bin/perl
use strict;
use warnings;
my $all_samples={};
while(<>) {
chomp;
# Note that I'm using variable names here to document
# The format of the file being read. This makes for
# easier trouble-shooting -- if a column is missing,
# It's easier to tell that $file_base_name shouldn't be
# 'Anna' than that $line[0] should not be 'Anna'.
my ( $file_base_name, $given_name, $sex ) = split("\t", $_);
push(@{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
}
my $dir = ".";
opendir(DIR, $dir);
my @files = grep(/\.txt$/,readdir(DIR));
closedir(DIR);
FILE: foreach my $file (@files) {
BASE: foreach my $base (keys %{$all_samples}){
next BASE unless( $file =~ /$base/ );
push @{$all_samples->{$base}}, $file;
}
}
foreach my $k (keys %{$all_samples} ) {
print join( "\t", @{ $all_samples->{$k} } ) . "\n";
}
I prefer hashrefs to hashes, simply because I tend to deal with nested data structures -- I'm simply more used to seeing $all_samples->{$k}
than $all_samples{$k}
... more importantly, I'm using the full power of the arrayref, meaning that I'm not having to re-split the array that's already been split once.
G. Cito brings up an interesting point: why did I use
push(@{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
Rather than
push(@{$all_samples->{$file_base_name} }, [ $file_base_name, $given_name, $sex ] );
There's nothing syntactically wrong with the latter, but it wasn't what I was trying to accomplish:
Let's look at what $all_samples->{$base} would look like after
push @{$all_samples->{$base}}, $file;
If the original push had been this:
push(@{$all_samples->{$file_base_name} }, [ $file_base_name, $given_name, $sex ] );
@{$all_samples->{$base}}
would look like this:
(
[ $file_base_name, $given_name, $sex ],
$file
)
If instead, we use
push(@{$all_samples->{$file_base_name} }, ( $file_base_name, $given_name, $sex ) );
@{$all_samples->{$base}}
looks like this after push @{$all_samples->{$base}}, $file
:
(
$file_base_name,
$given_name,
$sex,
$file
)
For instance:
(
"AANB",
"John",
"male",
"AANB.txt"
)
So when we print the array:
print join( "\t", @{ $all_samples->{$k} } ) . "\n";
Will print
AANB John male AANB.txt