Search code examples
perlperl-data-structures

Help with walking / sorting a complex Perl data structure (HoH with AoH fun)


I've been banging my head against the wall for a couple of hours now.

I have a data structure that looks like this (output from "Data::Dumper"). It's my own fault, I'm creating the data structure as I'm parsing some input.

print Dumper $data;

___OUTPUT___
$VAR = { 'NAME' => {
                    'id' => '1234',
                    'total' => 192,
                    'results' =>  { 
                                     'APPLE'   => 48 ,
                                     'KUMQUAT' => 61 ,
                                     'ORANGE'  => 33 ,
                                  }
                   }

       }
  • There are thousands of "NAME" keys.
  • There is only ever one "id" and one "total".
  • There may be one or more key/value pairs in the "results" hash.

I want to print out a comma separated list, first sorted by "total" and then by the value of each hash in the "results" array.

The following code was used to print out a CSV from the already stored data structure.

use strict;
use warnings;
# [...lots of other stuff...]

open (my $fh, >out.csv);
print $fh "Name, ID, Label, Count, Total\n";

foreach ( sort { $data->{$b}->{total} <=> $data->{$a}->{total} }
    keys %{$data} )
{
    my $name = $_;
    foreach (
        sort {
            $data->{$name}->{results}->{$a} <=> $data->{$name}->{results}
              ->{$b}
        } values %{ $data->{$name}->{results} }
      )
    {

        print $fh $name . ","
          . $data->{$name}->{id} . "," . "'"
          . $_ . ","
          . $data->{$name}->{results}->{$_} . "," . "\n";
    }
    print $fh $name . ","
      . $data->{$name}->{id} . "," . "," . ","
      . $data->{$name}->{total} . "\n";
}

close($fh);

This was fine and worked well (apart from reminding me why I never use Perl anymore).

Example output was like this:

Name, ID,  Label,   Count, Total
foo, 1234, ORANGE,    33,
foo, 1234, APPLE,     48,
foo, 1234, KUMQUAT,   61,
foo, 1234,     ,        ,  142
bar, 1101, BIKE,      20,
bar  1101,     ,        ,  20

HOWEVER! I noticed I was getting key collisions (in the "results" hash) and as I need to keep and report on all of the data, I decided to try changing "results" to an array of hashes...

print Dumper $data;

___OUTPUT___
$VAR = { 'NAME' => {
                    'id' => '1234',
                    'total' => 192,
                    'results' => [
                                   { 'APPLE'   => 48 },
                                   { 'KUMQUAT' => 61 },
                                   { 'ORANGE'  => 33 },
                                   { 'APPLE'   => 50 },
                                 ]
                   }
       }
  • There are thousands of "NAME" keys.
  • There is only ever one "id" and one "total".
  • There may be one or more hashes in the "results" array.
  • Each hash in the "results" array will only ever have one name/value pair.

Whether or not anyone has even read this far, I have to say it's fairly therapeutic writing this down so I'll carry on... ;-)

For the new data structure, I'm having a problem with the sort/print code.

use strict;
use warnings;
# [...lots of other stuff...]

open (my $fh, >out.csv);
print $fh "Name, ID, Label, Count, Total\n";

foreach ( sort { $data->{$b}->{total} <=> $data->{$a}->{total} }
    keys %{$data} )
{
    my $name = $_;
    foreach (
        sort {
            $data->{$name}->{results}->{$a} <=> $data->{$name}->{results}
              ->{$b}
        } values %{ $data->{$name}->{results} }
      )
    {
    # .... HELP ME FOR THE LOVE OF ALL THAT IS GOOD IN THE WORLD! ....
    # I'm at the point now where my brain is starting to slowly dribble from my
    # ears...
    }
    print $fh $name . "," 
      . $data->{$name}->{id} . "," . "," . ","
      . $data->{$name}->{total} . "\n";
}

close($fh);

If you've read this far I salute you. If you can help, I applaud you.

If anyone has suggestions about an alternate format for the data structure, then please let me know! (In case you're interested... I'm using the "flip-flop" operator to capture blocks of the source file which I then use, line-by-line, to create the data-structure. I also call external programs to calculate certain things (no Perl equivalents) and store the results also.)

Thanks


Solution

  • Okay, I'm saying this just one time: Always use objects when you have complex structures

    As you've discovered, your brain will explode trying to track arrays of hashes of arrays of arrays of hashes. This is the perfect reason why to create an object structure. It doesn't matter if you'll never reuse it. It makes your programming task, much much easier.

    The following package took me about 30 minutes to write and debug. If you used it, you would have saved yourself a lot of heartache and debugging.

    As a bonus, when you discovered your mistaken assumption (hey, everyone does it!) that you have multiple items with the same key in your RESULT array, you would only have to modify a few lines of easily to locate code instead of going through your entire program trying to keep everything together.

    I used your data structure except I make the RESULTS an array that contains an array with two items (label and amount) instead of a hash. I could have used a hash, but this way, I can return an array with two items in it. Now, that I think of it, there was really no reason to do this anyway.

    #! /usr/bin/env perl
    
    use warnings;
    use strict;
    use feature qw(say);
    use Data::Dumper;
    
    
    my %hash;
    my $obj;
    
    $obj = structure->new();
    $obj->Name("foo");
    $obj->Total("foo", 142);
    $obj->Id("foo", 1234);
    $obj->Push(qw(foo  ORANGE  33));
    $obj->Push(qw(foo  APPLE   48));
    $obj->Push(qw(foo  APPLE   50));
    $obj->Push(qw(foo  KUMQUAT 61));
    $obj->SortResults("foo");
    
    $obj->Name("bar");
    $obj->Total("bar", 20);
    $obj->Id("bar", 1100);
    $obj->Push(qw(bar BIKE    20));
    $obj->SortResults("bar");
    
    say Dumper($obj);
    exit 0;
    
    ########################################################################
    package structure;
    
    use Data::Dumper;
    
    #
    # New Structure containing all data
    # 
    sub new {
        my $class = shift;
    
        my $self = {};
    
        bless $self, $class;
        return $self;
    }
    
    #
    # Either adds a new name object or returns name object;
    #
    sub Name {
        my $self = shift;
        my $name = shift;
    
        if (not defined $self->{$name}) {
            $self->{$name}->{ID} = undef;
            $self->{$name}->{TOTAL} = undef;
            $self->{$name}->{RESULTS} = [];
        }
        return $self->{$name};
    }
    
    #
    # Returns a list of Names
    #
    sub NameList {
        my $self = shift;
    
        return keys %{$self};
    }
    #
    # Either returns the id or sets $name's id
    #
    sub Id {
        my $self = shift;
        my $name = shift;
        my $id = shift;
    
        my $nameObj = $self->Name($name);
        if (defined $id) {
            $nameObj->{ID} = $id;
        }
        return $nameObj->{ID};
    }
    
    #
    # Either returns the total for $name or sets $name's total
    #
    sub Total {
        my $self = shift;
        my $name = shift;
        my $total = shift;
    
        my $nameObj = $self->Name($name);
        if (defined $total) {
            $nameObj->{TOTAL} = $total;
        }
        return $nameObj->{TOTAL};
    }
    
    #
    # Pushes new product and amount on $name's result list
    #
    sub Push {
        my $self = shift;
        my $name = shift;
        my $product = shift;
        my $amount = shift;
    
        my $nameObj = $self->Name($name);
        my @array = ("$name", "$amount");
        push @{$nameObj->{RESULTS}}, \@array;
        return @array;
    }
    
    #
    # Pops product and amount on $name's result list
    #
    sub Pop {
        my $self = shift;
        my $name = shift;
    
        my $nameObj = $self->Name($name);
        my $arrayRef = pop @{$nameObj->{RESULTS}};
        return @{$arrayRef};
    }
    
    sub SortResults {
        my $self = shift;
        my $name = shift;
    
        my $nameObj = $self->Name($name);
        my @results = @{$nameObj->{RESULTS}};
    my @sortedResults = sort {$a->[1] <=> $b->[1]} @results;
    my $nameObj->{RESULTS} = \@sortedResults;
        return @sortedResults;
    }
    

    $obj->SortResults will sort the results in place, but you can use it to retrieve the results as a sorted list anyway. To sort the items by totals, you could have used:

    my @sortedItems = sort {$obj->Total($a) <=> $obj->Total($b)} $obj->NameList();
    

    In short, you would have saved yourself time and the cleaning women a mess to clean up. (Exploded brains are very difficult to scrub from the walls and ceiling).

    I've learned from experience that anytime you start talking about hashes of hashes that contain arrays that point to other hashes, it's time to create an object to handle the mess. It might seem to take a lot longer to create objects for these type of one time jobs, but in my experience, you can usually churn out what you need and test in 30 minutes which saves you hours of frustration later on.