Search code examples
htmlpdfapache-tika

Tika isnt reading pdf properly


I am using Tika to convert pdf files to html and the output is not as expected. The document is 8 pages long and only 2 pages are being read, but they are being repeated in the output. For example it outputs page 2, 2, 2, 3, 3, 3, 3, 2. The meta data also ouputs:

pdf:charsPerPage: 1791
pdf:charsPerPage: 1791
pdf:charsPerPage: 1791
pdf:charsPerPage: 5672
pdf:charsPerPage: 5672
pdf:charsPerPage: 5672
pdf:charsPerPage: 5672
pdf:charsPerPage: 1791

What could be happening here? The file in question is publicaly available here: Phantom_3_Standard_Quick_Start_Guide_en_201509.pdf


Solution

  • The reason for this surprising text extraction result is that the content streams of pages 1, 2, 3, and 8 are very similar, each drawing the content of all four pages, and they only differ in a horizontal shift of coordinates, some clip paths, and minor details.

    Basically each of these pages draws all of the following image but hides different, unwanted parts by shifting them out of the page area or using clip paths:

    screenshot A

    Also the content streams of page 4-7 are very similar in the same way, basically:

    screenshot B

    In particular the text in those sets of four does not differ. Tika apparently ignores whether or not the text it extracts is visible. Thus, you get the same extracted text in those sets of four.


    I used ShowVicinity, a small ad-hoc tool based on PDFBox, to make the whole vicinity of the PDF pages visible.