Search code examples
pythonpdfpdfminerpdf-parsing

pdfminer pdf2text outputs 'FF'


enter image description here

I have a pdf. After installing pdfminer.six in my win 10, python 3.6 environment, I ran:

$ pdf2txt.py -o test1 download.pdf

Giving me the screenshot output. When I run:

$ dumppdf.py -o test2 download.pdf

I get:

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="47" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">+&#13;N&#158;&#213;&#233;&#197;&#176;&#8;&#207;&#15;&#60;&#133;M&#140;&#4;</string>
<string size="16">&#34;&#179;&#255;&#28;&#221;&#234;&#177;&#39;&#166;&#133;&#15;&#214;&#237;&#25;&#196;&#205;</string>
</list></value>
<key>Root</key>
<value><ref id="46" /></value>
<key>Size</key>
<value><number>48</number></value>
</dict>
</trailer>

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="47" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">+&#13;N&#158;&#213;&#233;&#197;&#176;&#8;&#207;&#15;&#60;&#133;M&#140;&#4;</string>
<string size="16">&#34;&#179;&#255;&#28;&#221;&#234;&#177;&#39;&#166;&#133;&#15;&#214;&#237;&#25;&#196;&#205;</string>
</list></value>
<key>Root</key>
<value><ref id="46" /></value>
<key>Size</key>
<value><number>48</number></value>
</dict>
</trailer>

What do I do next? How can I get this working?


Solution

  • The reason why pdfminer can not extract any usable text from the document in question is that the document does not contain text!

    More exactly, that Worksheet PDF does not contain text drawing instructions, merely graphics drawing instructions (the results of which look like text). PDF text extractors (like pdfminer), on the other hand, are inspecting only the text drawing instructions, so they will return nothing.

    To mine data from such documents, therefore, you had better go for OCR instead of text extraction.


    In a comment you asked

    how do you know that only graphic instructions are contained? What tools do you use?

    You need a PDF browser application and some knowledge of PDF internals.

    As PDF browser I usually use iText RUPS or PDFBox PDF Debugger. But there are other good browsers, too, e.g. there is one included in Adobe Preflight.

    Using such a PDF browser you can inspect the content streams of the PDF which contain the instructions for drawing the pages. And in your case these content streams do not contain any text drawing instructions, merely graphics drawing ones.

    The knowledge of PDF internals one can get by studying the PDF specification ISO 32000-2 (the old precursor specification ISO 32000-1 is a good starting point, too, if the newer spec is not at hand) and analyzing many real-word PDFs by it.