How to correctly use Stanford NLP pipeline for two-phase annotation?
In the first phase I need only tokenization and sentence splitting, so I use this code:
private Annotation annotatedDocument = null;
private StanfordCoreNLP pipeline = null;
...
public void firstPhase() {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");
pipeline = new StanfordCoreNLP(props);
annotatedDocument = new Annotation(textDocument);
}
The second phase is optional, so I don't use all annotator in the first phase. The second phase code:
public void secondPhase() {
POSTaggerAnnotator posTaggerAnot = new POSTaggerAnnotator();
posAnot.annotate(annotatedDocument);
// Lemmatization
MorphaAnnotator morphaAnot = new MorphaAnnotator();
morphaAnot.annotate(annotatedDocument);
}
First question: Is this approach using "stand-alone" annotators in the second phase correct? Or is there a way to use existing pipeline?
Second question: I have problem with Correference annotator. I would like use it in the second phase as follow:
CorefAnnotator coref = new CorefAnnotator(new Properties());
But this constructor seems to be never ending. Constructor without properties doesn't exist, right? Is it some properties setting necessary?
There are [at least] 3 ways you can do this:
The way you described. It's perfectly valid to just call individual annotators, and chain them together. The coref annotator should work with empty properties -- perhaps you need more memory? It's a bit slow to load, and the models are not small.
If you want to keep using a pipeline, you can create a partial pipeline and set the property enforceRequirements=false
. This will do the chaining of annotators for you, but doesn't require their requirements to be satisfied -- i.e., if you know some annotations are already there, you don't have to re-run their corresponding annotators.
This is a bigger change, but the simple api actually does this sort of lazy evaluation automatically. So, you can just create a Document
object, and when you request various annotations, it'll lazily fault them in.