Search code examples
pdfms-worddoc

pdf to word conversion shows garbage contents


I have a public PDF with no copying restrictions. However, When I try to copy text from the PDF into Word I only get unreadable gibberish (like boxes and symbols).

I tried changing the fonts and this also didn't work. I can not understand what is causing this issue.

I also searched for some online tools, but none of those seem to work either.

Any help or ideas?

Cheers.


Solution

  • If every (online) tool you try is unable to process this document, there are two options:

    1. Every tool is wrong
    2. Your document is wrong

    I think conclusion nr 2 makes sense. But, allow me to explain what is likely the culprit.

    First of, you should think of pdf documents as containers of instructions, rather than WYSIWYG documents. So extracting text is already a non-trival thing.

    But the issue here seems to be a problem of encoding. Your document contains instructions like "draw ب at position 10, 50". (I've used Arabic text as an example.)

    Without any further information, it becomes very difficult for a viewer (like Adobe) to know how to handle copy-paste functionality. The copy-paste buffer on your system doesn't use glyphs, but unicode.

    In other words, when copy-pasting, Adobe has to attempt to convert glyphs into actual unicode.

    Usually, a font contains information that helps. We call this a 'toUnicode' map. It tells the system how certain glyphs match up with certain characters.

    If your font does not contain this kind of mapping, using that font will prevent you form properly copy-pasting. Then again, there are also programs that produce faulty toUnicode maps on purpose (as a way of preventing copy-pasting from that document).