unpacking an unknown serialised format with variable length

I'm using Perl (5.8.8, don't ask) and I'm looking at a serialised binary file that I want to parse and snaffle information from.

The format is as follows:

7 bytes whose meaning I do not know (DB DB 00 00 7A 03 00)
null (0x00)
7-byte string with user id
null (0x00)
12-byte string to be discarded
null (0x00)
3-byte number specifying number of items to follow
null (0x00)
variable-length string for first item
newline (0x0a)
variable-length string for second item
newline (0x0a)
etc ...
null (0x00)
7-byte string with user id
etc ...

My current code somewhat naively skips the first 8 bytes, then reads byte per byte until it hits a null and then does very specific parsing.

sub readGroupsFile {
    my %index;

    open (my $fh, "<:raw", "groupsfile");
    seek($fh, 8, 0);
    while (read($fh, my $userID, 7)) {
        $index{$userID} = ();
        seek($fh, 18, 1);
        my $groups = "";
        while (read($fh, my $byte, 1)) {
            last if (ord($byte) == 0);
            $groups .= $byte;
        }
        my @grouplist = split("\n", $groups);
        $index{$userID} = \@grouplist;
    }
    close($fh);

    return \%index;
}

Good news? It works.

However, I think it's not very elegant, and wonder if I can use the 2-byte number that specifies the amount of items to follow to my advantage to speed up the parsing. I have no idea why else it would be there.

I think unpack() and its templates may provide an answer, but I can't figure out how it can work with variable-length arrays of strings with their own variable lengths.

Solution

You have no idea how much to read, so reading in the whole file at once will get you the best speed results.

{
   my $file = do { local $/; <> };

   $file =~ s/^.{8}//s
      or die("Bad data");

   while (length($file)) {
      $file =~ s/^([^\0]*)\0[^\0]*\0[^\0]*\0([^\0]*)\0//
         or die("Bad data");

      my $user_id = $1;
      my @items = split(/\n/, $2, -1);
      ...
   }
}

By using a buffer, you can get most of the benefits of reading in the whole file at once without actually reading the whole file in at once, but it will make the code more complicated.

{
   my $buf = '';
   my $not_eof = 1;

   my $reader = sub {
      $not_eof &&= read(\*ARGV, $buf, 1024*1024, length($buf));
      die($!) if !defined($not_eof);
      return $not_eof;
   };

   while ($buf !~ s/^.{8}//s) {
      $reader->()
         or die("Bad data");
   }      

   while (length($buf) || $reader->()) {
      my $user_id;
      my @items;
      while (1) {
         if ($buf =~ s/^([^\0]*)\0[^\0]*\0[^\0]*\0([^\0]*)\0//) {
            $user_id = $1;
            @items = split(/\n/, $2, -1);
            last;
         }

         $reader->()
            or die("Bad data");
      }

      ...
   }
}