Search code examples
javaparsingnutchapache-tika

Changing parsers in tika-config.xml results in "Unable to load org.apache.tika.parser.DefaultParser"


I'm trying to enable Tika's BoilerpipeContentHandler parser within Nutch to extract the article text from web pages. To do this, I've configured tika-config.xml to exclude the HTMLParser and activate the BoilerpipeContentHandler parser as follows:

<properties>
     <service-loader initializableProblemHandler="ignore" loadErrorHandler="WARN" />
 <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>text/html</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.html.HtmlParser"/>
    </parser>

 <!-- Use a different parser for text/html -->
    <parser class="org.apache.tika.parser.html.BoilerpipeContentHandler">
      <mime>text/html</mime>
    </parser>
  </parsers>
</properties>

When I test this configuration by running the command:

bin/nutch org.apache.nutch.parse.ParserChecker

The output includes:

Dec 12, 2019 5:11:40 PM org.apache.tika.config.LoadErrorHandler$2 handleLoadError
WARNING: Unable to load org.apache.tika.parser.DefaultParser
java.lang.ClassNotFoundException: org.apache.tika.parser.html.HtmlParser
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

and

Dec 12, 2019 5:11:40 PM org.apache.tika.config.LoadErrorHandler$2 handleLoadError
WARNING: Unable to load org.apache.tika.parser.html.BoilerpipeContentHandler
java.lang.ClassNotFoundException: org.apache.tika.parser.html.BoilerpipeContentHandler
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)

I have the classpath set correctly, so I can't figure out why the two parser classes aren't being found. I'm wondering whether Nutch or Tika is using a different classpath perhaps? Or maybe there's something obviously wrong with my tika-config.xml.

I'd really appreciate any ideas you have.


Solution

  • I'm going to focus on your end goal: using the boilerplate extractor with Nutch. Nutch already provides support for the boilerplate extractor from within Nutch itself, no need to change the tika-config.xml.

    You need to set the tika.extractor property to boilerpipe in your nutch-site.xml. By default, Nutch will use the ArticleExtractor extractor.

    You can check https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1645-L1677 for some additional configuration options that are exposed.