Search code examples
pdfboxapache-tika

How to further process a buggy / malformed PDF that cannot be parsed by Tika / PDFBox but can by Evince / Libre Office Draw?


My program is reading documents with Tika 2.24 to extract their contents.

Yet some PDFs (maybe buggy or malformed) cannot be processed by PDFBox although Evince, Libre Office Draw or even Gimp can open them.

I cannot share these PDFs but what I can tell is that they used to trigger a StackOverFlow Error as described on Jira with PDFBox 2.0.25 and now trigger an IOException with PDFBox 2.0.26 :

Caused by: java.io.IOException: Possible recursion detected when dereferencing object 29 0

Consequently now that an IOException can be caught it is tempting to try and process a malformed PDF differently from the first parsing that triggered the IOException.

I read that PDFBox offers a way to handle malformed PDFs by setting setLenient(true) on a parser but could not find a way to set such leniency in Tika.

By the way I followed the solution with both setLenient(true and false) but the IOException still appears.

Edit : following KJ's suggestion I ran pdftotext which output the following warnings :

Syntax Error (5602): Object '29 0 obj' is being already parsed Syntax Error (5603): Bad 'Length' attribute in stream Syntax Error (8596): Missing 'endstream' or incorrect stream length Syntax Error (16945): Object '35 0 obj' is being already parsed Syntax Error (16946): Bad 'Length' attribute in stream Syntax Error (23267): Missing 'endstream' or incorrect stream length Syntax Error (23332): Object '37 0 obj' is being already parsed Syntax Error (23333): Bad 'Length' attribute in stream Syntax Error (28645): Missing 'endstream' or incorrect stream length

(Please note : there are 4 pages which seem to be malformed as PDFSam cannot export them separately).

Opening the pdf file in Text Editor as suggested by KJ did only reveal a single hit for "29 0 obj". Using mutool show -be mypdf.pdf 29 outputs a warning: PDF stream Length incorrect and then the compressed content.

[QPDF check] Still following KJ advices, running QPDF with check flag yields:

checking myPDFWithIssues.pdf
PDF Version: 1.5
File is not encrypted
File is not linearized
WARNING: myPDFWithIssues.pdf (offset 5602): loop detected resolving object 29 0
WARNING: myPDFWithIssues.pdf (object 29 0, offset 5552): /Length key in stream dictionary is not an integer
WARNING: myPDFWithIssues.pdf (object 29 0, offset 5603): attempting to recover stream length
WARNING: myPDFWithIssues.pdf (object 29 0, offset 5603): recovered stream length: 2983
WARNING: myPDFWithIssues.pdf (offset 16945): loop detected resolving object 35 0
WARNING: myPDFWithIssues.pdf (object 35 0, offset 16895): /Length key in stream dictionary is not an integer
WARNING: myPDFWithIssues.pdf (object 35 0, offset 16946): attempting to recover stream length
WARNING: myPDFWithIssues.pdf (object 35 0, offset 16946): recovered stream length: 6311
WARNING: myPDFWithIssues.pdf (offset 23332): loop detected resolving object 37 0
WARNING: myPDFWithIssues.pdf (object 37 0, offset 23282): /Length key in stream dictionary is not an integer
WARNING: myPDFWithIssues.pdf (object 37 0, offset 23333): attempting to recover stream length
WARNING: myPDFWithIssues.pdf (object 37 0, offset 23333): recovered stream length: 5302

Yet the faulty PDF has been regenerated by another user (from the same sources) and the newer PDF does not show any warnings. So issue will be hard to track!

So my question is : how can I process with Tika / PDFBox malformed PDFs that trigger the aforementioned IOException related to possible recursion ?

Any hint appreciated


Solution

  • The quick and dirty way I employed was to use external command line tool pdftotext (from package poppler-utils on Debian / Ubuntu) as suggested by @KJ in their (now deleted) comments.