Search code examples
javalucenewikitokenizewikipedia

WikipediaTokenizer Lucene


I try to parse Wikipedia markup with Lucene and found this little project:

http://lucene.apache.org/core/3_0_3/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html

(was not able to retrieve a proper website, sorry)

Below is a shorter version of a code example that circulates somehow around this library. When running it, I get a non-null WikipediaTokenizer, but I get a null pointer exception as soon as I execute incrementToken(). Any ideas?

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.wikipedia.WikipediaTokenizer;

import java.io.StringReader;

public class WikipediaTokenizerTest {
  static Logger logger = Logger.getLogger(WikipediaTokenizerTest.class);
  protected static final String LINK_PHRASES = "click [[link here again]] click     [http://lucene.apache.org here again] [[Category:a b c d]]";

public WikipediaTokenizer testSimple() throws Exception {
    String text = "This is a [[Category:foo]]";
    return new WikipediaTokenizer(new StringReader(text));
}
public static void main(String[] args){
    WikipediaTokenizerTest wtt = new WikipediaTokenizerTest();

    try {
        WikipediaTokenizer x = wtt.testSimple();

        logger.info(x.hasAttributes());

        while (x.incrementToken() == true) {
            logger.info("Token found!");
        }

    } catch(Exception e){
        logger.error("Exception while tokenizing Wiki Text: " + e.getMessage());
    }

}

I use the following dependencies for Maven (pom.xml):

<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>4.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>4.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        <version>4.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-wikipedia</artifactId>
        <version>3.0.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers</artifactId>
        <version>3.1.0</version>
    </dependency>

Any help would be appreciated! If someone has a better library or solution please let me know.


Solution

  • You can't mix-and-match your lucene versions. You are using version 4.2.1. It is not compatible with either version 3.1.0 or 3.0.3. You need to remove those dependencies.

    WikipediaTokenizer is included in analyzers-common.


    Also, you are not fulfilling the contract required by TokenStream. See the TokenStream documentation where where the workflow of the TokenStream API is described. In particular, before ever calling incrementToken(), you must call reset(). You should really also end() and close() it.

    WikipediaTokenizer x = wtt.testSimple();
    logger.info(x.hasAttributes());
    x.reset();
    while (x.incrementToken() == true) {
        logger.info("Token found!");
    }
    x.end();
    x.close();