Compare lines in a file

I have a large dataset that looks like this:

identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,4,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...
17221882, 1,1,7,...

I would like to write a script that groups these lines by identifier (so the first 3 and the last 2 would be grouped) in order to compare them. So, for example, I would get the 3 29239999 and take one of the two with feature 3 as 3 and the last with feature 3 as 7. In particular, I would like to take the one that has the largest feature 2 (it would be the third line for 29239999).

My specific question: of my two options: (1) hashes and (2) making each identifier an object and then comparing them, which is the best?

Solution

If you really are working with a "large" data set and the data is already grouped by id like in your example, then I suggest that you process these as you go instead of building a huge hash.

use strict;
use warnings;

# Skip Header row
<DATA>;

my @group;
my $lastid = '';

while (<DATA>) {
    my ($id, $data) = split /,\s*/, $_, 2;

    if ($id ne $lastid) {
        processData($lastid, @group);
        @group = ();
    }

    push @group, $data;
    $lastid = $id;
}

processData($lastid, @group);

sub processData {
    my $id = shift;

    return if ! @_;

    print "$id " . scalar(@_) . "\n";

    # Rest of code here
}

__DATA__
identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,4,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...
17221882, 1,1,7,...

Outputs

29239999 3
17221882 2