Search code examples
solrapache-tika

"zip bomb" exception while sending HTML document to Solr


I'm sending a HTML document to Solr and Tika is throwing the "Zip bomb detected!" exception back. Solr log reports: "Suspected zip bomb: 100 levels of XML element nesting"

Looking at Tika source, there is an arbitrary limit of 100 level of XML element nesting (See here).

The variable in question (maxDepth) does have a public setter function but I am not sure if it's possible to set this at Solr. Is it possible?

Here is the full stack trace:

2018-04-05 16:47:48.034 ERROR (qtp1654589030-15) [   x:aconn] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Zip bomb detected!
    at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
    at ca.calgary.csc.wds.solr.GsaAconnRequestHandler.handleRequestBody(GsaAconnRequestHandler.java:84)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.server.Server.handle(Server.java:534)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
    at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.tika.exception.TikaException: Zip bomb detected!
    at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:192)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:138)
    at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
    ... 35 more
Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting
    at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:234)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:255)
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:297)
    at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:251)
    at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:167)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:60)
    at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
    at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
    at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
    at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:625)
    at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
    at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:135)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    ... 36 more

Edit: I found a Jira issue which appears to be caused in a similar way. The solution, given by Tim Allison is to use Tika's default HTML mapper instead of the one Solr has. How can I set this up in Solr config?

Edit2: I have verified that this it not a Tika issue as the tika-app jar is able to successfully extract file contents

>java -jar tika-app-1.16.jar -t test.html

Solution

  • As per Tim, it is not possible to set this up via Solr config. As an alternative, the recommendation I have found mentioned in other places is to run Tika outside of Solr, i.e. not use Solr Cell