Search code examples
perlmongodb

Perl & MongoDB binary data


From the MongoDB manual:

By default, all database strings are UTF8. To save images, binaries, and other non-UTF8 data, you can pass the string as a reference to the database.

I'm fetching pages and want store the content for later processing.

  • I can not rely on meta-charset, because many pages has utf8 content but wrongly declaring iso-8859-1 or similar
  • so can't use Encode (don't know the originating charset)
  • therefore, I want store the content simply as flow of bytes (binary data) for later processing

Fragment of my code:

sub save {
    my ($self, $ok, $url, $fetchtime, $request ) = @_;

    my $rawhead = $request->headers_as_string;
    my $rawbody = $request->content;

    $self->db->content->insert(
        { "url" => $url, "rhead" => \$rawhead, "rbody" => \$rawbody } ) #using references here
      if $ok;

    $self->db->links->update(
        { "url" => $url },
        {
            '$set' => {
                'status'       => $request->code,
                'valid'        => $ok,
                'last_checked' => time(),
                'fetchtime'    => $fetchtime,
            }
        }
    );
}

But get error:

Wide character in subroutine entry at /opt/local/lib/perl5/site_perl/5.14.2/darwin-multi-2level/MongoDB/Collection.pm line 296.

This is the only place where I storing data.

The question: The only way store binary data in MondoDB is encode them e.g. with base64?


Solution

  • It looks like another sad story about _utf8_ flag...

    I may be wrong, but it seems that headers_as_string and content methods of HTTP::Message return their strings as a sequence of characters. But MongoDB driver expects the strings explicitly passed to it as 'binaries' to be a sequence of octets - hence the warning drama.

    A rather ugly fix is to take down the utf8 flag on $rawhead and $rawbody in your code (I wonder shouldn't it be really done by MongoDB driver itself?), by something like this...

    _utf8_off $rawhead; 
    _utf8_off $rawbody; # ugh
    

    The alternative is to use encode('utf8', $rawhead) - but then you should use decode when extracting values from DB, and I doubt it's not uglier.