Search code examples
perlmarc

perl: utf8 <something> does not map to Unicode while <something> doesn't seem to be present present


I'm using MARC::Lint to lint some MARC records, but every now and them I'm getting an error (on about 1% of the files):

utf8 "\xCA" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212.

The problem is that I've tried different methods but cannot find "\xCA" in the file...

My script is:

#!perl -w
use MARC::File::USMARC;
use MARC::Lint;
use utf8;

use open OUT => ':utf8';

my $lint = new MARC::Lint;
my $filename = shift;

my $file = MARC::File::USMARC->in( $filename );
while ( my $marc = $file->next() ) {
    $lint->check_record( $marc );
    # Print the errors that were found
    print join( "\n", $lint->warnings ), "\n";
} # while

and the file can be downloaded here: http://eroux.fr/I14376.mrc

Is "\xCA" hidden somewhere? Or is this a bug in MARC::Lint?


Solution

  • The problem has nothing to do with MARC::Lint. Remove the lint check, and you'll still get the error.

    The problem is a bad data file.

    The file contains a "directory" of where the information is located in the file. The following is a human-readable rendition of the directory for the file you provided:

    tagno|offset|len   # Offsets are from the start of the data portion.
    001|00000|0017     # Length include the single-byte field terminator.
    006|00017|0019     # Offset and lengths are in bytes.
    007|00036|0015
    008|00051|0041
    035|00092|0021
    035|00113|0021
    040|00134|0018
    050|00152|0022
    066|00174|0009
    245|00183|0101
    246|00284|0135
    264|00419|0086
    300|00505|0034
    336|00539|0026
    337|00565|0026
    338|00591|0036
    546|00627|0016
    500|00643|0112
    505|00755|9999  <--
    506|29349|0051
    520|29400|0087
    533|29487|0115
    542|29602|0070
    588|29672|0070
    653|29742|0013
    710|29755|0038
    720|29793|0130
    776|29923|0066
    856|29989|0061
    880|30050|0181
    880|30231|0262
    

    Notice the length of the field with tag 505, 9999. This is the maximum value supported (because the length is stored as four decimal digits). The catch is that value of that field is far larger than 9,999 bytes; it's actually 28,594 bytes in size.

    What happens is that the module extracts 9,999 bytes rather than 28,594. This happens to cut a UTF-8 sequence in half. (The specific sequence is CA BA, the encoding of ʼ.) Later, when the module attempts to decode that text, an error is thrown. (CA must be followed by another byte to be valid.)

    Are these records you are generating? If so, you need to make sure that no field requires more than 9,999 bytes.

    Still, the module should handle this better. It could read until it finds a end-of-field marker instead of using the length when it finds no end-of-field marker where it expects one and/or it could handle decoding errors in a non-fatal manner. It already has a mechanism to report these problems ($marc->warnings).

    In fact, if it hadn't died (say if the cut happened to occur in between characters instead of in the middle of one), $marc->warnings would have returned the following message:

    field does not end in end of field character in tag 505 in record 1