I use PDF::API2
in my Perl application to embed OCR
output behind the corresponding image, allowing the resulting PDF to be searched, as the OCR
output can be extracted with pdftotext
.
At the moment, as soon as the application sees a non-ASCII character in the OCR output, it switches from PDF core fonts to TTF. However, this is really hacky, as the core fonts include most Western European characters. TTF is only necessary for Greek, Russian, Japanese, etc.
How can I tell whether a particular font includes a particular
character (including the CMAP table so that extraction with
pdftotext
works)?
Have you tried the glyph-specific methods?
http://search.cpan.org/dist/PDF-API2/lib/PDF/API2/Resource/BaseFont.pm#GLYPH_RELATED_METHODS
Failing that, perhaps rendering the glyph (to a separate document) and measuring it?