Search code examples
stringperlfrequency

Perl count frequency of keys in hash


I have extracted the first level of keys from multidimensional hash, which look like:

my @string = keys %hash;

print "@string\n";

Bacteroides fragilis (strain YCH46).Agrocybe aegerita (Black poplar mushroom) (Agaricus 
aegerita).Parabacteroides distasonis (strain ATCC 8503 / DSM 20701 / CIP 104284 / JCM 5825 / NCTC 
11152).Pelodictyon phaeoclathratiforme (strain DSM 5477 / BU-1).Clostridium kluyveri (strain NBRC 
12016).Torpedo marmorata (Marbled electric ray).Aethionema grandiflorum (Persian stone-cress).Conus 
consors (Singed cone).Saguinus labiatus (Red-chested mustached tamarin).Staphylococcus haemolyticus 
(strain JCSC1435).Aeromonas salmonicida (strain A449).Acinetobacter genomosp. 13.Staphylococcus 
aureus (strain USA300 / TCH1516).Loxosceles variegata (Recluse spider). and so on...

I am trying to count how many times a same organism is repeated (I know for sure that some of there are repeated many times).

I have tried this code:

my %count;

foreach my $os (@string)  
{ 
$count{$os}++; 
} 


foreach my $os (sort keys %count)  
{ 
print $os, " ", $count{$os}, "\n";
} 

But I obtain the output like all of the organisms where just appearing once, although I know that is not the case.

Strangely, when I tried to define a test string manually with some organisms repeated, the code worked.

What is happening with my hash keys?

I am able to access them separately within the list so they are well defined in principle...

Any help?

Edited:

Dumper structure when organism are values:

'ACYP_SYNJB' => {
                        '94' => 'Synechococcus sp. (strain JA-2-3B\'a(2-13)) 
(Cyanobacteria bacterium Yellowstone B-Prime).'
                      },
      'ACTM_STRPU' => {
                        '374' => 'Strongylocentrotus purpuratus (Purple sea 
urchin).'
                      },
      'A2ML1_HUMAN' => {
                         '1454' => 'Homo sapiens (Human).'
                       },
      'ACTP_SALDC' => {
                        '549' => 'Salmonella dublin (strain CT_02021853).'
                      },
      'ACBG2_XENLA' => {
                         '739' => 'Xenopus laevis (African clawed frog).'
                       },
      'ACO1_AJECA' => {
                        '476' => 'Ajellomyces capsulatus (Darling\'s disease 
fungus) (Histoplasma capsulatum).'
                      },
      'ACTM_PISOC' => {
                        '376' => 'Pisaster ochraceus (Ochre sea star) 
(Asterias ochracea).'
                      },
      '3MGH_RHOPB' => {
                        '200' => 'Rhodopseudomonas palustris (strain 
BisB18).'
                      }
    };

And when keys:

$VAR3585 = 'Geobacter sulfurreducens (strain ATCC 51573 / DSM 12127 / PCA).';
$VAR3586 = {
         'ACPS_GEOSL' => 126,
         'ACP_GEOSL' => 77,
         'ACKA_GEOSL' => 421,
         'ACYP_GEOSL' => 91,
         'ACCA_GEOSL' => 319
       };
$VAR3587 = 'Bactrocera dorsalis (Oriental fruit fly) (Dacus dorsalis).';
$VAR3588 = {
         'ACT3_BACDO' => 376,
         'ACT5_BACDO' => 376,
         'ACT1_BACDO' => 376,
         'ACT2_BACDO' => 376
       };
$VAR3589 = 'Caenorhabditis elegans.';
$VAR3590 = {
         'ACH5_CAEEL' => 511,
         '6PGD_CAEEL' => 484,
         'ACM2_CAEEL' => 627,
         'ACADM_CAEEL' => 417,
         'ADAL_CAEEL' => 388,
         'ACON_CAEEL' => 777,
         'ACBP3_CAEEL' => 116,
         '2AB1_CAEEL' => 495,
         '3HIDH_CAEEL' => 299,
         'ACH1_CAEEL' => 498,
         '6PGL_CAEEL' => 269,
         '2A51_CAEEL' => 542,
         '2AAA_CAEEL' => 590,
         'A16L2_CAEEL' => 534,
         'ACH4_CAEEL' => 548,
         'ACC2_CAEEL' => 445,
         'ADA17_CAEEL' => 686,
         'ACR5_CAEEL' => 598,
         'ACTL1_CAEEL' => 360,
         'ADBP1_CAEEL' => 217,
         'ACH8_CAEEL' => 474,
         '5NT3_CAEEL' => 376,
         'ACT2_CAEEL' => 376,
         'AAR2_CAEEL' => 357,
         'ACH23_CAEEL' => 545,
         'ACD11_CAEEL' => 617,
         'ABF2_CAEEL' => 85,
         'ABDH3_CAEEL' => 375,
         'ABF1_CAEEL' => 85,
         'ABH51_CAEEL' => 355,
         'ACX15_CAEEL' => 659,
         'ACC1_CAEEL' => 466,
         'ABL1_CAEEL' => 1224,
         'ACC3_CAEEL' => 517,
         'ABH52_CAEEL' => 444,
         'ACT4_CAEEL' => 376,
         'ACH2_CAEEL' => 493,
         'ACBP1_CAEEL' => 86,
         '14332_CAEEL' => 248,
         'ACR7_CAEEL' => 538,
         'ACC4_CAEEL' => 408,
         'ACE1_CAEEL' => 620,
         'AATC_CAEEL' => 408,
         'ACH6_CAEEL' => 502,
         'ACH3_CAEEL' => 564,
         'ACR3_CAEEL' => 487,
         'ACMSD_CAEEL' => 401,
         'ACH7_CAEEL' => 507,
         'ACR2_CAEEL' => 575,
         'ACASE_CAEEL' => 272,
         'ACM3_CAEEL' => 611,
         'AAPK2_CAEEL' => 626,
         'ACN1_CAEEL' => 906,
         '3HAO_CAEEL' => 281,
         'ADAS_CAEEL' => 597,
         'ACT1_CAEEL' => 376,
         'A4_CAEEL' => 686,
         'ADA10_CAEEL' => 922,
         'A16L1_CAEEL' => 578,
         'ACT3_CAEEL' => 376,
         'ACP1_CAEEL' => 426,
         'ACM1_CAEEL' => 713,
         'AAPK1_CAEEL' => 589,
         'ACOC_CAEEL' => 887,
         'ACLY_CAEEL' => 1106,
         '14331_CAEEL' => 248
       };
$VAR3591 = 'Anopheles stephensi (Indo-Pakistan malaria mosquito).';
$VAR3592 = {
         'ACES_ANOST' => 664
       };
$VAR3593 = 'Bacillus thuringiensis subsp. konkukian (strain 97-27).';
$VAR3594 = {
         'ACKA_BACHK' => 397,
         'ACCD_BACHK' => 289,
         'ACPS_BACHK' => 119,
         '3MGH_BACHK' => 205,
         'ACCA_BACHK' => 324,
         'ACP_BACHK' => 77
       };

More exactly, I wanto to know which organisms have more than 50 proteins Ids in my hash, and select them, getting rid of the other organisms with less number of proteins


Solution

  • More exactly, I wanto to know which organisms have more than 50 proteins Ids in my hash, and select them, getting rid of the other organisms with less number of proteins

    I'm not fully sure that I've completely understood your question but it looks like you have the following kind of hash:

    my %hash = (
        'protein_id#1' => {
             'some-number' => 'organism-name'
        },
        'protein_id#2' => {
             'some-number' => 'same-or-other-organism-name',
        },
        ...
    );
    

    And you want to count how many protein_id#X´ are for each differentorganism-name`.

    In this case the following should work:

     my %organism;
     # "outer" hash has protein_id as key
     while (my ($protein,$h2) = each %hash) {
         # "inner" hash has organism-name as value
         # same organism could maybe be multiple times inside the same inner hash
         # but should only be counted once per protein_id
         my %organism;
         while (my ($some_number,$o) = each %$h2) {
             $organism{$o}++
         } 
         for (keys %organism) {
              $count{$_}++;
         }
     }