Search code examples
perlhashdatasetlarge-files

Compare lines in a file


I have a large dataset that looks like this:

identifier,feature 1, feature 2, feature 3, ...
29239999, 2,5,3,...
29239999, 2,4,3,...
29239999, 2,6,7,...
17221882, 2,6,7,...
17221882, 1,1,7,...

I would like to write a script that groups these lines by identifier (so the first 3 and the last 2 would be grouped) in order to compare them. So, for example, I would get the 3 29239999 and take one of the two with feature 3 as 3 and the last with feature 3 as 7. In particular, I would like to take the one that has the largest feature 2 (it would be the third line for 29239999).

My specific question: of my two options: (1) hashes and (2) making each identifier an object and then comparing them, which is the best?


Solution

  • If you really are working with a "large" data set and the data is already grouped by id like in your example, then I suggest that you process these as you go instead of building a huge hash.

    use strict;
    use warnings;
    
    # Skip Header row
    <DATA>;
    
    my @group;
    my $lastid = '';
    
    while (<DATA>) {
        my ($id, $data) = split /,\s*/, $_, 2;
    
        if ($id ne $lastid) {
            processData($lastid, @group);
            @group = ();
        }
    
        push @group, $data;
        $lastid = $id;
    }
    
    processData($lastid, @group);
    
    sub processData {
        my $id = shift;
    
        return if ! @_;
    
        print "$id " . scalar(@_) . "\n";
    
        # Rest of code here
    }
    
    __DATA__
    identifier,feature 1, feature 2, feature 3, ...
    29239999, 2,5,3,...
    29239999, 2,4,3,...
    29239999, 2,6,7,...
    17221882, 2,6,7,...
    17221882, 1,1,7,...
    

    Outputs

    29239999 3
    17221882 2