I try to parse Wikipedia markup with Lucene and found this little project:
(was not able to retrieve a proper website, sorry)
Below is a shorter version of a code example that circulates somehow around this library. When running it, I get a non-null WikipediaTokenizer, but I get a null pointer exception as soon as I execute incrementToken(). Any ideas?
import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.wikipedia.WikipediaTokenizer;
import java.io.StringReader;
public class WikipediaTokenizerTest {
static Logger logger = Logger.getLogger(WikipediaTokenizerTest.class);
protected static final String LINK_PHRASES = "click [[link here again]] click [http://lucene.apache.org here again] [[Category:a b c d]]";
public WikipediaTokenizer testSimple() throws Exception {
String text = "This is a [[Category:foo]]";
return new WikipediaTokenizer(new StringReader(text));
}
public static void main(String[] args){
WikipediaTokenizerTest wtt = new WikipediaTokenizerTest();
try {
WikipediaTokenizer x = wtt.testSimple();
logger.info(x.hasAttributes());
while (x.incrementToken() == true) {
logger.info("Token found!");
}
} catch(Exception e){
logger.error("Exception while tokenizing Wiki Text: " + e.getMessage());
}
}
I use the following dependencies for Maven (pom.xml):
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>4.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>4.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>4.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-wikipedia</artifactId>
<version>3.0.3</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers</artifactId>
<version>3.1.0</version>
</dependency>
Any help would be appreciated! If someone has a better library or solution please let me know.
You can't mix-and-match your lucene versions. You are using version 4.2.1. It is not compatible with either version 3.1.0 or 3.0.3. You need to remove those dependencies.
WikipediaTokenizer
is included in analyzers-common.
Also, you are not fulfilling the contract required by TokenStream
. See the TokenStream
documentation where where the workflow of the TokenStream API is described. In particular, before ever calling incrementToken()
, you must call reset()
. You should really also end()
and close()
it.
WikipediaTokenizer x = wtt.testSimple();
logger.info(x.hasAttributes());
x.reset();
while (x.incrementToken() == true) {
logger.info("Token found!");
}
x.end();
x.close();