When trying to copy and paste into a MS word document from a PDF document which has some sets of fonts embedded, the result is illegible.
Several symbols are changed or even disappear.
Using Adobe Acrobat I can check which specific fonts are embedded.
You should check your PDF document's fonts first with the help of the pdffonts
utility. That is part of the XPDF package for Windows and can be used without installing, just from a DOS box.
In order to successfully extract text (or copy'n'paste it) from a PDF, the font should either use a standard encoding (not a Custom
one), and it should have a /ToUnicode
table associated with it inside the PDF.
pdffonts
returns a few basic information items about the fonts used by your PDF.
Example output:
$ pdffonts -f 3 -l 5 sample.pdf
name type encoding emb sub uni object ID
------------------------- ------------- ------------ --- --- --- ---------
IADKRB+Arial-BoldMT CID TrueType Identity-H yes yes yes 10 0
SSKFGJ+ArialMT CID TrueType Custom yes yes no 11 0
The command above asked for the fonts used in the page range 3
(first to check) to 5
(last page to check).
In the above case, both used fonts are embedded as subsets (indicated by the XYZABC+
-prefixes to their names, as well as by the yes
in the emb
and the sub
columns).
The font SSKFGJ+ArialMT
uses a custom encoding, but the PDF has no /ToUnicode
for this font, as indicated by the no
entry for the column headed uni
.
Hence it is not easy to extract text that is shown with this font (extraction would require manual reverse engineering -- but then you can also just "read" the PDF pages).
You should check first, if copy'n'pasting of text works if you use a simple text file as a target (not an MS Word document). If it doesn't, you can already forget about MS Word...
- Would installing such fonts in Microsoft Word work it out?
- If so, where can I get or even create those subsets of the fonts I need?
- If not, how could I solve this problem?
You can, unfortunately, not get the exactly same info about the fonts used by a PDF via Acrobat or Adobe Reader. What you can get via Menu -> File -> Properties... is
But you do not get the info about the presence of a /ToUnicode
table.