I have a large dataset that looks like this:
identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,4,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...
17221882, 1,1,7,...
I would like to write a script that groups these lines by identifier (so the first 3 and the last 2 would be grouped) in order to compare them. So, for example, I would get the 3 29239999 and take one of the two with feature 3 as 3 and the last with feature 3 as 7. In particular, I would like to take the one that has the largest feature 2 (it would be the third line for 29239999).
My specific question: of my two options: (1) hashes and (2) making each identifier an object and then comparing them, which is the best?
If you really are working with a "large" data set and the data is already grouped by id like in your example, then I suggest that you process these as you go instead of building a huge hash.
use strict;
use warnings;
# Skip Header row
<DATA>;
my @group;
my $lastid = '';
while (<DATA>) {
my ($id, $data) = split /,\s*/, $_, 2;
if ($id ne $lastid) {
processData($lastid, @group);
@group = ();
}
push @group, $data;
$lastid = $id;
}
processData($lastid, @group);
sub processData {
my $id = shift;
return if ! @_;
print "$id " . scalar(@_) . "\n";
# Rest of code here
}
__DATA__
identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,4,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...
17221882, 1,1,7,...
Outputs
29239999 3
17221882 2