Search code examples
apache-tikatika-server

Is there a way to turn off parsing of embedded docs in the tika-server?


I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain" header) from our application.

Since Tika 1.15, the default behaviour is now to "extract all embedded documents" TIKA-2096.

I want to be able to turn this behaviour off on our tika-server so that embedded documents are NOT extracted and I only get the text rendition of the main document contents.

Is it possible to do this via a tika-config.xml file, or do I need to do a custom build and subclass EmbeddedDocumentExtractor so that it doesn't do anything?

An answer to tika-parser-exclude-pdf-attachments indicates that you can turn this behaviour off by subclassing EmbeddedDocumentExtractor, but I'd like to check if it's possible to do this via tika-config.xml without having to do a custom build of the tika-server.

I have looked at Configuring Tika but there is no mention of embedded docs here.


Solution

  • The answers in tika-parser-exclude-pdf-attachments are excellent for if you are calling Tika via code.

    Previously there hasn't been a way to do this for embedded files in Tika Server, other than disabling the whole file type using EmptyParser with something like the below:

    <?xml version="1.0" encoding="UTF-8"?>
    <properties>
        <parsers>
            <parser class="org.apache.tika.parser.EmptyParser">
                <mime-exclude>image/jpeg</mime-exclude>
                <mime-exclude>application/zip</mime-exclude>
            </parser>
        </parsers>
    </properties>
    

    This has become a common request, so I've added a feature coming up in Tika 1.25 (yet to be released) to allow for the skipping embedded files using a header setting:

    curl -T test_recursive_embedded.docx http://localhost:9998/tika --header "Accept: text/html" --header "X-Tika-Skip-Embedded: true"
    

    Any parser using the EmbeddedDocumentExtractor will honour this.