Search code examples
javaapache-tika

Significance of writelimit in BodyContentHandler of apache tika api?


In our application we are supposed to check if a file (any format) is password protected or not, For that purpose we are using Apache Tika API. The code piece would look something like below.

public static boolean isPasswordProtectedFile(File filePart) {
    Parser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();

    try {
        // parsing the file and testing for Password
        parser.parse(FileUtils.openInputStream(filePart), handler, metadata, context);
        LOGGER.debug("File is without Password ");
    } catch (EncryptedDocumentException e) {
        LOGGER.error("File is encrypted with password", e);
        return true;
    } catch (Exception e) {
        LOGGER.error("File parsing failed", e);
    }
    return false;
}

But this is consuming too much of the CPU for few files we tested. But if we create BodyContentHandler like below. then it completes faster and uses less CPU. BodyContentHandler handler = new BodyContentHandler(-1);

I went through the documentation but couldn't understand it correctly. Looking forward for a probable reason for that. Thanks in advance.


Solution

  • As per the document, it says

    https://tika.apache.org/1.4/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler(int)

    Creates a content handler that writes XHTML body character events to an internal string buffer. The contents of the buffer can be retrieved using the ContentHandlerDecorator.toString() method. The internal string buffer is bounded at the given number of characters. If this write limit is reached, then a SAXException is thrown.

    writeLimit - maximum number of characters to include in the string, or -1 to disable the write limit

    Buffer is never initialized here.