perl: utf8 <something> does not map to Unicode while <something> doesn't seem to be present present

I'm using MARC::Lint to lint some MARC records, but every now and them I'm getting an error (on about 1% of the files):

utf8 "\xCA" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212.

The problem is that I've tried different methods but cannot find "\xCA" in the file...

My script is:

#!perl -w
use MARC::File::USMARC;
use MARC::Lint;
use utf8;

use open OUT => ':utf8';

my $lint = new MARC::Lint;
my $filename = shift;

my $file = MARC::File::USMARC->in( $filename );
while ( my $marc = $file->next() ) {
    $lint->check_record( $marc );
    # Print the errors that were found
    print join( "\n", $lint->warnings ), "\n";
} # while

and the file can be downloaded here: http://eroux.fr/I14376.mrc

Is "\xCA" hidden somewhere? Or is this a bug in MARC::Lint?

Solution

The problem has nothing to do with MARC::Lint. Remove the lint check, and you'll still get the error.

The problem is a bad data file.

The file contains a "directory" of where the information is located in the file. The following is a human-readable rendition of the directory for the file you provided:

tagno|offset|len   # Offsets are from the start of the data portion.
001|00000|0017     # Length include the single-byte field terminator.
006|00017|0019     # Offset and lengths are in bytes.
007|00036|0015
008|00051|0041
035|00092|0021
035|00113|0021
040|00134|0018
050|00152|0022
066|00174|0009
245|00183|0101
246|00284|0135
264|00419|0086
300|00505|0034
336|00539|0026
337|00565|0026
338|00591|0036
546|00627|0016
500|00643|0112
505|00755|9999  <--
506|29349|0051
520|29400|0087
533|29487|0115
542|29602|0070
588|29672|0070
653|29742|0013
710|29755|0038
720|29793|0130
776|29923|0066
856|29989|0061
880|30050|0181
880|30231|0262

Notice the length of the field with tag 505, 9999. This is the maximum value supported (because the length is stored as four decimal digits). The catch is that value of that field is far larger than 9,999 bytes; it's actually 28,594 bytes in size.

What happens is that the module extracts 9,999 bytes rather than 28,594. This happens to cut a UTF-8 sequence in half. (The specific sequence is CA BA, the encoding of ʼ.) Later, when the module attempts to decode that text, an error is thrown. (CA must be followed by another byte to be valid.)

Are these records you are generating? If so, you need to make sure that no field requires more than 9,999 bytes.

Still, the module should handle this better. It could read until it finds a end-of-field marker instead of using the length when it finds no end-of-field marker where it expects one and/or it could handle decoding errors in a non-fatal manner. It already has a mechanism to report these problems ($marc->warnings).

In fact, if it hadn't died (say if the cut happened to occur in between characters instead of in the middle of one), $marc->warnings would have returned the following message:

field does not end in end of field character in tag 505 in record 1