Search code examples
pdfcopy-paste

PDF Copy Text Issue: Weird Characters


I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards


Solution

  • In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.

    Mapping character codes to Unicode as described in the PDF specification

    The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.

    It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.

    Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.

    In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:

    If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

    What happens if the algorithm above fails to produce a Unicode value

    This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.

    That the different programs you tried returned so different results shows that

    1. your PDF does not contain the information required for the algorithm above from the PDF specification and

    2. the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.

    What to do in such a case

    There are multiple options, more or less feasible depending on your concrete case:

    1. Ask the source of the PDF for a version that contains proper information for text extraction.

      Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...

    2. Apply OCR to the PDF in question.

      Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...

    3. You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".

      Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...