I'm somewhat new to perl programming and I've got a hash which could be formulated like this:
$hash{"snake"}{ACB2} = [70, 120];
$hash{"snake"}{SGJK} = [183, 120];
$hash{"snake"}{KDMFS} = [1213, 120];
$hash{"snake"}{VCS2} = [21, 120];
...
$hash{"bear"}{ACB2} = [12, 87];
$hash{"bear"}{GASF} = [131, 87];
$hash{"bear"}{SDVS} = [53, 87];
...
$hash{"monkey"}{ACB2} = [70, 230];
$hash{"monkey"}{GMSD} = [234, 230];
$hash{"monkey"}{GJAS} = [521, 230];
$hash{"monkey"}{ASDA} = [134, 230];
$hash{"monkey"}{ASMD} = [700, 230];
The structure of the hash is in summary:
%hash{Organism}{ProteinID}=(protein_length, total_of_proteins_in_that_organism)
I would like to sort this hash according to some conditions. First, I would only like to take into consideration those organisms with a total number of proteins higher than 100, then I would like to show the name of the organism as well as the largest protein and its length.
For this, I'm going for the following approach:
foreach my $org (sort keys %hash) {
foreach my $prot (keys %{ $hash{$org} }) {
if ($hash{$org}{$prot}[1] > 100) {
@sortedarray = sort {$hash{$b}[0]<=>$hash{$a}[0]} keys %hash;
print $org."\n";
print @sortedarray[-1]."\n";
print $hash{$org}{$sortedarray[-1]}[0]."\n";
}
}
}
However, this prints the name of the organism as many times as the total number of proteins, for instance, it prints "snake" 120 times. Besides, this is not sorting properly because i guess I should make use of the variables $org and $prot in the sorting line.
Finally, the output should look like this:
snake
"Largest protein": KDMFS [1213]
monkey
"Largest protein": ASMD [700]
All data sorted in print
use warnings;
use strict;
use feature 'say';
use List::Util qw(max);
my %hash;
$hash{"snake"}{ACB2} = [70, 120];
$hash{"snake"}{SGJK} = [183, 120];
$hash{"snake"}{KDMFS} = [1213, 120];
$hash{"snake"}{VCS2} = [21, 120];
$hash{"bear"}{ACB2} = [12, 87];
$hash{"bear"}{GASF} = [131, 87];
$hash{"bear"}{SDVS} = [53, 87];
$hash{"monkey"}{ACB2} = [70, 230];
$hash{"monkey"}{GMSD} = [234, 230];
$hash{"monkey"}{GJAS} = [521, 230];
$hash{"monkey"}{ASDA} = [134, 230];
$hash{"monkey"}{ASMD} = [700, 230];
my @top_level_keys_sorted =
sort {
( max map { $hash{$b}{$_}->[0] } keys %{$hash{$b}} ) <=>
( max map { $hash{$a}{$_}->[0] } keys %{$hash{$a}} )
}
keys %hash;
for my $k (@top_level_keys_sorted) {
say $k;
say "\t$_ --> @{$hash{$k}{$_}}" for
sort { $hash{$k}{$b}->[0] <=> $hash{$k}{$a}->[0] }
keys %{$hash{$k}};
}
This first sorts the top-level keys by the first number in the arrayref value, per requirement. With that sorted list of keys on hand we then go inside each key's hashref and sort further. That loop is what we'd tweak to limit output as wanted (first 100 by total number, only largest by length, etc).
It prints
snake KDMFS --> 1213 120 SGJK --> 183 120 ACB2 --> 70 120 VCS2 --> 21 120 monkey ASMD --> 700 230 GJAS --> 521 230 GMSD --> 234 230 ASDA --> 134 230 ACB2 --> 70 230 bear GASF --> 131 87 SDVS --> 53 87 ACB2 --> 12 87
I can't tell whether output should show all of "organisms with a total number of proteins higher than 100" (text) or only the largest one (desired output) so I am leaving all of it. Cut if off as needed. To get only the largest one either compare max from each key in the loop or see this post (same problem).
Note that a hash itself cannot be "sorted" as it is inherently unordered. But we can print things out sorted, as above, or generate ancillary data structures which can be sorted, if needed.