Search code examples
databaseperlbigdataperformance-testing

What is the fastest way, in Perl, to access a 1.6 millon row list of key/value pairs?


I have a list of 1.6 millon lines that looks like this:

N123HN  /var/foo/bar/baz/A/Alpha.file.1234.bin
N123HN  /var/foo/bar/baz/A/Alpha.file.1235.bin
N123KL  /var/foo/bar/baz/A/Alpha.file.1236.bin

I have a Perl script that basically just greps this data on the second column, as a way of looking up the value in the first column (then it does other magic with the "N123HN" value, etc.). As it is now, my app spends about 4 minutes ingesting the file and loading it into a huge hash (key/value array). While the grep-like functions themselves are slow for obvious reasons, the slowest part of running this script is this huge ingest of data each time it runs.

Anyone have any clever ideas how to access this data more quickly? Since it is just a list of two columns, a relational database seems pretty heavyweight for this use case.

I'm re-editing the original question here since pasting source code into the comments boxes is pretty ugly.

The algorithm I'm using to ingest the huge file is this:

while(<HUGEFILE>)
    {
      # hugefile format:
      # nln N123HN ---- 1 0 1c44f5.4a6ee12 17671854355 /var/foo/bar/baz/A/Alpha.file.1234.bin 0

      next if /^(\s)*$/;      # skip blank lines
      chomp;                  # remove trailing newline characters
      @auditrows = split;     # an array of entire rows, split on whitespace
      my $file_url = $auditrows[7];              # /var/foo/bar/baz/A/Alpha.file.1234.bin 
      my $tapenum  = "$auditrows[1] ";          # N123HN
      $tapenumbers{ $file_url } = $tapenum;      # key   = "/var/foo/bar/baz/A/Alpha.file.1234.bin" 
    }                                           # value = "N123HN"

Solution

  • Have you tried using a hash with the second column as the key and the first column as the value? Then you can iterate over the 200 or so file paths and look them up in the hash directly. This is probably going to be a lot faster than using the grep function. Here's a quick script that would load the data:

    #!/usr/bin/perl
    my %data;
    open(my $fh, 'data') || die;
    while (<$fh>) {
        my ($k, $path) = split;
        push @{$data{$path}}, $k;
    }
    print "loaded data: ", scalar(%data), "\n";
    

    My perl is pretty rusty, but this runs really quick on my laptop with a 1.6 million line input file.

    pa-mac-w80475xjagw% head -5 data
    N274YQ  /var/foo/bar/baz/GODEBSVT/Alpha.file.9824.bin
    N602IX  /var/foo/bar/baz/UISACEXK/Alpha.file.5675.bin
    N116CH  /var/foo/bar/baz/GKUQAYWF/Alpha.file.7146.bin
    N620AK  /var/foo/bar/baz/DHYRCLUD/Alpha.file.2130.bin
    N716YD  /var/foo/bar/baz/NYMSJLHU/Alpha.file.2343.bin
    pa-mac-w80475xjagw% wc -l data
     1600000 data
    pa-mac-w80475xjagw% /usr/bin/time -l ./parse.pl
    loaded data: 1118898/2097152
            5.54 real         5.18 user         0.36 sys
     488919040  maximum resident set size
             0  average shared memory size
             0  average unshared data size
             0  average unshared stack size
        119627  page reclaims
             1  page faults
             0  swaps
             0  block input operations
             0  block output operations
             0  messages sent
             0  messages received
             0  signals received
             0  voluntary context switches
            30  involuntary context switches