Apache Tika for parsing only Office docs - Build exclusions

I would like to parse files to text/xml.

I only really need to parse Microsoft Office documents (specifically, Microsoft Word).

I currently include the entire tika-parsers dependency in my application.

Since this is heavy and includes a lot of things I don't need, is there a list of modules I can safely exclude if I'm only interested in parsing Office documents?

Solution

There is a Tika version which splits the libraries into modules based on the types of files they parse.

While it seems that this version is no longer being updated, it can be used as a guide to which modules are necessary for which file type you're parsing.

For example, looking at the pom.xml of the tika-parser-advanced-module, you can see that it depends on opennlp-tools, but that the tika-parser-office-module does not. Therefore, if you're only interested in parsing office docs, you can exclude opennlp-tools.

In addition, looking at the ivy:report (for maven, the dependency tree) can help.

If anyone has any input on this, I'm still open to hearing suggestions/comments.