I'm programming a noun phrase extractor in Java, and I'm trying to use the OpenNLP library to tag nouns. Unfortunately, the documentation for OpenNLP is very confusing. At the moment, I'm merely tokenizing a string of English text. The documentation has me initializing the token model using something similar to this:
InputStream modelIn = new FileInputStream("en-token.bin");
try {
TokenizerModel model = new TokenizerModel(modelIn);
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("An input sample sentence.");
What I'm confused about here is what "en-token.bin" is, and where exactly I can find it. Was it supposed to be included in the original download of zipped files? Or do I have to download it from OpenNLP's website?
Here's the link to the documentation: https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.tokenizer
Any help you could give me would be very appreciated. Thank you in advance!
You can find the models at http://opennlp.sourceforge.net/models-1.5/. They're not part of the original download at Apache due to licensing reasons.