Search code examples
javanlpstanford-nlp

CRFClassifier doesn't recognize sentence splitter options


I'm using CoreNLP to annotate NEs in multiline English text. When doing as follows:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
props.put("ssplit.newlineIsSentenceBreak", "always");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String contentStr = "John speaks with Martin\n\nJeremy talks to him too.";
Annotation document 
= new  Annotation(contentStr);
pipeline.annotate(document);
List<CoreMap> sents = document.get(SentencesAnnotation.class);
for (int i = 0; i < sents.size(); i++) {
    System.out.println("sentence " + i + " "+ sents.get(i));
}

Sentence splitting works fine and recognizes two sentences. However, when I use NER classification as follows:

CRFClassifier classifier = CRFClassifier.getClassifier("edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz", props);
String classifiedStr = classifier.classifyWithInlineXML(contentStr);

I get the following error message:

Unknown property: |ssplit.newlineIsSentenceBreak|  Unknown property: |annotators|

and the classifier seems to consider all the text as one sentence resulting in false recognition of an entity "Martin Jeremy" instead of two distinct entities.

Any idea what's wrong?


Solution

  • The properties taken by the CRFClassifier.getClassifier are different from the properties taken by StanfordCoreNLP constructor, that's why you get the error that the option is unknown.

    It will be set, but it won't be used at run time.

    From here, you will find that you need to set the properties of the SeqClassifierFlags. You need to set tokenizerOptions, and set the option to "tokenizeNLs = true", which considers new lines as tokens.

    Bottom line, set the property as follows, before getting the classifier. It should not give you the error of unknown property, and it should work as intended.

    Properties props = new Properties();
    props.put("tokenizerOptions", "tokenizeNLs=true");
    
    CRFClassifier classifier = CRFClassifier.getClassifier("edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz", props);
    String classifiedStr = classifier.classifyWithInlineXML(contentStr);