I have a list of 1.6 millon lines that looks like this:
N123HN /var/foo/bar/baz/A/Alpha.file.1234.bin
N123HN /var/foo/bar/baz/A/Alpha.file.1235.bin
N123KL /var/foo/bar/baz/A/Alpha.file.1236.bin
I have a Perl script that basically just greps this data on the second column, as a way of looking up the value in the first column (then it does other magic with the "N123HN" value, etc.). As it is now, my app spends about 4 minutes ingesting the file and loading it into a huge hash (key/value array). While the grep-like functions themselves are slow for obvious reasons, the slowest part of running this script is this huge ingest of data each time it runs.
Anyone have any clever ideas how to access this data more quickly? Since it is just a list of two columns, a relational database seems pretty heavyweight for this use case.
I'm re-editing the original question here since pasting source code into the comments boxes is pretty ugly.
The algorithm I'm using to ingest the huge file is this:
while(<HUGEFILE>)
{
# hugefile format:
# nln N123HN ---- 1 0 1c44f5.4a6ee12 17671854355 /var/foo/bar/baz/A/Alpha.file.1234.bin 0
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
@auditrows = split; # an array of entire rows, split on whitespace
my $file_url = $auditrows[7]; # /var/foo/bar/baz/A/Alpha.file.1234.bin
my $tapenum = "$auditrows[1] "; # N123HN
$tapenumbers{ $file_url } = $tapenum; # key = "/var/foo/bar/baz/A/Alpha.file.1234.bin"
} # value = "N123HN"
Have you tried using a hash with the second column as the key and the first column as the value? Then you can iterate over the 200 or so file paths and look them up in the hash directly. This is probably going to be a lot faster than using the grep
function. Here's a quick script that would load the data:
#!/usr/bin/perl
my %data;
open(my $fh, 'data') || die;
while (<$fh>) {
my ($k, $path) = split;
push @{$data{$path}}, $k;
}
print "loaded data: ", scalar(%data), "\n";
My perl is pretty rusty, but this runs really quick on my laptop with a 1.6 million line input file.
pa-mac-w80475xjagw% head -5 data
N274YQ /var/foo/bar/baz/GODEBSVT/Alpha.file.9824.bin
N602IX /var/foo/bar/baz/UISACEXK/Alpha.file.5675.bin
N116CH /var/foo/bar/baz/GKUQAYWF/Alpha.file.7146.bin
N620AK /var/foo/bar/baz/DHYRCLUD/Alpha.file.2130.bin
N716YD /var/foo/bar/baz/NYMSJLHU/Alpha.file.2343.bin
pa-mac-w80475xjagw% wc -l data
1600000 data
pa-mac-w80475xjagw% /usr/bin/time -l ./parse.pl
loaded data: 1118898/2097152
5.54 real 5.18 user 0.36 sys
488919040 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
119627 page reclaims
1 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
30 involuntary context switches