I'm using openNLP API at java for a project that I'm working on. The thing is that with my program i only process words alone with no correspondence. The code:
String line = input.nextLine();
InputStream inputStreamTokenizer = new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/pt-token.bin");
TokenizerModel tokenModel = new TokenizerModel(inputStreamTokenizer);
//Instantiating the TokenizerME class
TokenizerME tokenizer = new TokenizerME(tokenModel);
String tokens[] = tokenizer.tokenize(line);
InputStream inputStream = new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/pt-sent.bin");
SentenceModel model = new SentenceModel(inputStream);
//Instantiating the SentenceDetectorME class
SentenceDetectorME detector = new SentenceDetectorME(model);
//Detecting the sentence
String sentences[] = detector.sentDetect(line);
//Loading the NER-location model
//InputStream inputStreamLocFinder = new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/en-ner-location.bin");
//TokenNameFinderModel model = new TokenNameFinderModel(inputStreamLocFinder);
//Loading the NER-person model
InputStream inputStreamNameFinder = new FileInputStream("/home/bruno/TryOllie/data/pt-ner-floresta.bin");
TokenNameFinderModel model2 = new TokenNameFinderModel(inputStreamNameFinder);
//Instantiating the NameFinderME class
NameFinderME nameFinder2 = new NameFinderME(model2);
//Finding the names of a location
Span nameSpans2[] = nameFinder2.find(tokens);
//Printing the spans of the locations in the sentence
//for(Span s: nameSpans)
//System.out.println(s.toString()+" "+tokens[s.getStart()]);
Set<String> x = new HashSet<String>();
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
Span[] tokenz = simpleTokenizer.tokenizePos(line);
Set<String> tk = new HashSet<String>();
for( Span tok : tokenz){
tk.add(line.substring(tok.getStart(), tok.getEnd()));
for(Span n: nameSpans2)
System.out.println(n.toString()+ " -> " + tokens[n.getStart()]);
The output i get is:
Ficheiro com extensao: file.txt
[1..2) event -> choque[3..4) event -> cadeia[6..7) artprod -> viaturas[13..14) event -> feira[16..18) place -> Avenida[20..21) place -> Porto[24..25) event -> incêndio[2..3) event -> acidente[5..6) artprod -> viaturas[44..45) organization -> JN[46..47) person -> António[47..48) place -> Campos[54..60) organization -> Batalhão[1..2) event -> acidente[6..8) numeric -> 9[11..12) place -> Porto-Matosinhos[21..22) event -> ocorrência[29..30) artprod -> .[4..5) organization -> Sapadores[7..10) organization -> Bombeiros[14..15) numeric -> 15
What im trying to do is a multi term NER, like Antonio Campos is a person, not Person -> Antonio and Place -> Campos, or Organisation -> Universidade Nova de Lisboa
Your are printing the wrong data structure. The span getSart and getEnd will point to the sequence of tokens that are part of the entity. You are printing just the first token.
Also, you are doing tokenization before sentence detection.
Try the following code:
// load the models outside your loop
InputStream inputStream =
new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/pt-sent.bin");
SentenceModel model = new SentenceModel(inputStream);
//Instantiating the SentenceDetectorME class
SentenceDetectorME detector = new SentenceDetectorME(model);
InputStream inputStreamTokenizer =
new FileInputStream("/home/bruno/openNLP/apache-opennlp-1.7.2-src/models/pt-token.bin");
TokenizerModel tokenModel = new TokenizerModel(inputStreamTokenizer);
//Instantiating the TokenizerME class
TokenizerME tokenizer = new TokenizerME(tokenModel);
//Loading the NER-person model
InputStream inputStreamNameFinder = new FileInputStream("/home/bruno/TryOllie/data/pt-ner-floresta.bin");
TokenNameFinderModel model2 = new TokenNameFinderModel(inputStreamNameFinder);
//Instantiating the NameFinderME class
NameFinderME nameFinder2 = new NameFinderME(model2);
String line = input.nextLine();
while(line != null) {
// first we find sentences
String sentences[] = detector.sentDetect(line);
for (String sentence :
sentences) {
// now we find the sentence tokens
String tokens[] = tokenizer.tokenize(sentence);
// now we are good to apply NER
Span[] nameSpans = nameFinder2.find(tokens);
// now we can print the spans
System.out.println(Arrays.toString(Span.spansToStrings(nameSpans, tokens)));
line = input.nextLine();