Search code examples
pdfapache-tika

Tika parser is not parsing all the file


I have a pdf file which is 122 pages. When I parse it using Tika (version 1.17), it doesn't return the whole text in the returned string.

I use the following simple code to get the text:

    String content = new Tika().parseToString(file);

The text that I get with this code, ends at around page 118. That is, the last pages are ignored.


Solution

  • Promoting a comment to an answer...

    Apache Tika will by default set a maximum size of text it'll allow a parser to generate, to avoid accidentally swamping a user. In your case, it looks like you're hitting that limit when you really do want more!

    As a user of the Tika facade helper class, you just need to call Tika.setMaxStringLength(int) with a higher limit, or -1 just to disable the limits entirely

    If you're using the Tika parser classes directly, then you should set a higher write limit (or -1) to your content handler, eg BodyContentHandler(int writeLimit)