Search code examples
javastanford-nlp

Stanford NLP pipeline – sequential processing (in Java)


How to correctly use Stanford NLP pipeline for two-phase annotation?


In the first phase I need only tokenization and sentence splitting, so I use this code:

private Annotation annotatedDocument = null;
private StanfordCoreNLP pipeline = null;

...

public void firstPhase() {
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit");

        pipeline = new StanfordCoreNLP(props);
        annotatedDocument = new Annotation(textDocument);
}

The second phase is optional, so I don't use all annotator in the first phase. The second phase code:

public void secondPhase() {
    POSTaggerAnnotator posTaggerAnot = new POSTaggerAnnotator();
    posAnot.annotate(annotatedDocument);

    // Lemmatization
    MorphaAnnotator morphaAnot = new MorphaAnnotator();
    morphaAnot.annotate(annotatedDocument);
}

First question: Is this approach using "stand-alone" annotators in the second phase correct? Or is there a way to use existing pipeline?

Second question: I have problem with Correference annotator. I would like use it in the second phase as follow:

CorefAnnotator coref = new CorefAnnotator(new Properties());

But this constructor seems to be never ending. Constructor without properties doesn't exist, right? Is it some properties setting necessary?


Solution

  • There are [at least] 3 ways you can do this:

    1. The way you described. It's perfectly valid to just call individual annotators, and chain them together. The coref annotator should work with empty properties -- perhaps you need more memory? It's a bit slow to load, and the models are not small.

    2. If you want to keep using a pipeline, you can create a partial pipeline and set the property enforceRequirements=false. This will do the chaining of annotators for you, but doesn't require their requirements to be satisfied -- i.e., if you know some annotations are already there, you don't have to re-run their corresponding annotators.

    3. This is a bigger change, but the simple api actually does this sort of lazy evaluation automatically. So, you can just create a Document object, and when you request various annotations, it'll lazily fault them in.