I'm using Perl (5.8.8, don't ask) and I'm looking at a serialised binary file that I want to parse and snaffle information from.
The format is as follows:
My current code somewhat naively skips the first 8 bytes, then reads byte per byte until it hits a null and then does very specific parsing.
sub readGroupsFile {
my %index;
open (my $fh, "<:raw", "groupsfile");
seek($fh, 8, 0);
while (read($fh, my $userID, 7)) {
$index{$userID} = ();
seek($fh, 18, 1);
my $groups = "";
while (read($fh, my $byte, 1)) {
last if (ord($byte) == 0);
$groups .= $byte;
}
my @grouplist = split("\n", $groups);
$index{$userID} = \@grouplist;
}
close($fh);
return \%index;
}
Good news? It works.
However, I think it's not very elegant, and wonder if I can use the 2-byte number that specifies the amount of items to follow to my advantage to speed up the parsing. I have no idea why else it would be there.
I think unpack()
and its templates may provide an answer, but I can't figure out how it can work with variable-length arrays of strings with their own variable lengths.
You have no idea how much to read, so reading in the whole file at once will get you the best speed results.
{
my $file = do { local $/; <> };
$file =~ s/^.{8}//s
or die("Bad data");
while (length($file)) {
$file =~ s/^([^\0]*)\0[^\0]*\0[^\0]*\0([^\0]*)\0//
or die("Bad data");
my $user_id = $1;
my @items = split(/\n/, $2, -1);
...
}
}
By using a buffer, you can get most of the benefits of reading in the whole file at once without actually reading the whole file in at once, but it will make the code more complicated.
{
my $buf = '';
my $not_eof = 1;
my $reader = sub {
$not_eof &&= read(\*ARGV, $buf, 1024*1024, length($buf));
die($!) if !defined($not_eof);
return $not_eof;
};
while ($buf !~ s/^.{8}//s) {
$reader->()
or die("Bad data");
}
while (length($buf) || $reader->()) {
my $user_id;
my @items;
while (1) {
if ($buf =~ s/^([^\0]*)\0[^\0]*\0[^\0]*\0([^\0]*)\0//) {
$user_id = $1;
@items = split(/\n/, $2, -1);
last;
}
$reader->()
or die("Bad data");
}
...
}
}