Search code examples
iospdftextnsstringligature

Read special character bytes from PDF to unichar or NSString


First off this solution doesn't work for ligatures: Convert or Print CGPDFStringRef string

I'm reading text from a PDF and trying to convert it to a NSString. I can get a byte array of text using Apple's CGPDFScanner in the form of a CGPDFString. The "fi" ligature character is giving me trouble. When I look at my byte array in the debugger I see a '\f'

So for simplicity sake lets say that I have this char:

unsigned char myLigatureFromPDF = '\f';

Ultimately I'd like to convert it to this (the unicode value for the "fi" ligature):

unichar whatIWant = 0xFB01;

This is my failed attempt (I copied this from PDFKitten btw):

 const char str[] = {myLigatureFromPDF, '\0'};
    NSString* stringEncodedLigature = [NSString stringWithCString:str encoding:NSUTF8StringEncoding];
    unichar encodedLigature = [stringEncodedLigature characterAtIndex:0];

If anyone can tell me how to do this that would be great Also, as a side note how does the debugger interpret the unencoded byte array, in other words when I hover over the array how does it know to show a '\f'

Thanks!


Solution

  • Every PDF parser is limited in its capabilities by one single important point of the PDF specifications: characters in literal strings are encoded as bytes or words, but the encoding does not need to be included in the file.

    For example, if a subset of a font is included where the code "1" corresponds to the image (character glyph) of an "h" and the code "2" maps to a glyph "a", the string (\1\2\1\2) will show "haha", as expected. But if the PDF contains no further information on how the glyphs in that font correspond to Unicode, there is no way for a string decoder to find out the correct character codes for "glyph #1" and "glyph #2".

    It seems your test PDF does contain that information -- else, how could it infer the correct characters for "regular" characters? -- but in this case, the "regular" characters were simply not remapped to other binary codes, for convenience. Also, again for convenience, the glyph for the single character "fi" was remapped to "0x0C" in the original font (or in the subset that got included into your file). But, again, if the file does not contain a translation table between character codes and Unicode values, there is no way to retrieve the correct code.

    The above is true for all PDFs and strings. If the font definition in the PDF contains an encoding, your string extraction method should use it; if the PDF contains a /ToUnicode table for the font, again, your method should use it. If it contains neither, you get the literal string contents (and, presumably, you are not informed which method was used and how reliable it is).

    As a final footnote: in TeX and LaTeX fonts, ligatures are mapped to lower ASCII codes (as well as a smattering of other non-ASCII codes, such as the curly quotes). It seems you are reading a PDF that was created through TeX here -- but that can only be inferred from this particular encoding. Also, even if you know in advance that the PDF was generated through TeX, it's not guaranteed that it does use this particular encoding, as the decision to translate or not translate is at the discretion of the PDF generator, not TeX itself.