Search code examples
javafile-iodownloadopennlp

How do I initialize the token model in OpenNLP?


I'm programming a noun phrase extractor in Java, and I'm trying to use the OpenNLP library to tag nouns. Unfortunately, the documentation for OpenNLP is very confusing. At the moment, I'm merely tokenizing a string of English text. The documentation has me initializing the token model using something similar to this:

InputStream modelIn = new FileInputStream("en-token.bin");

try {
TokenizerModel model = new TokenizerModel(modelIn);
}
catch (IOException e) {
   e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
 }
}

Tokenizer tokenizer = new TokenizerME(model);

String tokens[] = tokenizer.tokenize("An input sample sentence.");

What I'm confused about here is what "en-token.bin" is, and where exactly I can find it. Was it supposed to be included in the original download of zipped files? Or do I have to download it from OpenNLP's website?

Here's the link to the documentation: https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.tokenizer

Any help you could give me would be very appreciated. Thank you in advance!


Solution

  • You can find the models at http://opennlp.sourceforge.net/models-1.5/. They're not part of the original download at Apache due to licensing reasons.