I am trying to write a python app to give me a word count for PDFs.
I've run into something odd with this PDF though.
When I extract the text from the PDF, it shows up as some sort of binary/symbol garbage.
I have tried PyPDF2
and PyMuPDF
libs with the same result.
How can I get a word count on PDFs like this one?
Here is the file. https://www.dropbox.com/s/hdgqd70l0kcayvo/mhr.pdf?dl=0
That PDF is missing the information necessary for text extraction. Thus, an attempt to extract text from it usually outputs garbage.
The text in that PDF is drawn using a font which neither exposes a ToUnicode map nor an encoding with standardized names. It also does not mark the content with ActualText properties. Furthermore, a naive identity mapping of character codes to e.g. Latin-1 does not result in anything intelligible either.
Thus, text extraction according to the algorithm proposed in the PDF specification ISO 32000 (part 1 and 2) will for each character lead to the stage
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
(ISO 32000-1, section 9.10.2 Mapping Character Codes to Unicode Values)
You can see that Adobe Acrobat does not like this either by applying copy&paste.
In some such situations, though, diving deeper into an embedded font will turn up alternative mappings to Unicode, and some text extractors do use them.
Nonetheless, this approach won't help here, either, the font is a type 3 font, i.e. not based on some normal font format (e.g. TrueType) but instead completely defined using PDF vector graphics sequences without further mappings to Unicode.
Thus, without some degree of OCR (human or automatized) there is no way to extract the text from this PDF.
If this document indeed is published in its current form by some U.S. department (and not the output of some conversion tool applied to their original document), you might want to contact that department and discuss topics like accessibility and section 508...