Search code examples
javaboilerpipe

ClassNotFoundException: org.apache.xerces.parsers.AbstractSAXParser when using boilerpipe


I am very new to boilerpipe and I am trying out the following basic code:

package contentExtraction;

import java.net.URL;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class ContentExtractor {

    public static void main(String[] args) throws Exception {
        final URL url = new URL(
//              "http://www.l3s.de/web/page11g.do?sp=page11g&link=ln104g&stu1g.LanguageISOCtxParam=en"
            "http://www.dn.se/nyheter/vetenskap/annu-godare-choklad-med-hjalp-av-dna-teknik"
            );

       System.out.println(ArticleExtractor.INSTANCE.getText(url));
    }

}

But I am getting the following error when trying to run the above piece of code:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xerces/parsers/AbstractSAXParser
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(Unknown Source)
    at java.security.SecureClassLoader.defineClass(Unknown Source)
    at java.net.URLClassLoader.defineClass(Unknown Source)
    at java.net.URLClassLoader.access$100(Unknown Source)
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.net.URLClassLoader$1.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument(BoilerpipeSAXInput.java:51)
    at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:69)
    at de.l3s.boilerpipe.extractors.ExtractorBase.getText(ExtractorBase.java:87)
    at contentExtraction.ContentExtractor.main(ContentExtractor.java:16)
Caused by: java.lang.ClassNotFoundException: org.apache.xerces.parsers.AbstractSAXParser
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 16 more

I googled the error and came across this link. I thought that I was missing xercesImpl.jar in my dependencies. I included the same, still my code is giving the same error. What is the issue?


Solution

  • I figured the solution myself. The boilerpipe jar has further dependencies. I converted my project to maven project, included the dependency:

    <dependency>
        <groupId>com.syncthemall</groupId>
        <artifactId>boilerpipe</artifactId>
        <version>1.2.1</version>
    </dependency>
    

    When I build the above project, I can see there are actually 4 jars that are imported in Maven Dependencies folder:

    boilerpipe-1.2.1.jar
    nekohtml-1.9.18.jar
    xercesImpl-2.11.0.jar
    xml-apis-1.4.01.jar