Search code examples
javastanford-nlp

Stanford corenlp pause and continue annotation pipeline


Typically when you use the corenlp annotation pipeline for say NER you would write the following code

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);

I would like to perform sentence splitting i.e. ssplit in the above pipeline. But then I would like to remove sentences that are too long before I continue the rest of the pipeline. What I have been doing is performing sentence splitting, filtering the sentences by length, then performing NER by applying the entire pipeline i.e. tokenize, ssplit, pos, lemma, ner. So essentially I have performed tokenize and ssplit twice. Is there a more efficient way of doing this? For example performing tokenize and ssplit then pausing the pipeline to remove sentences that are too long, then resume the pipeline with pos, lemma, and ner.


Solution

  • You can create two pipeline objects, with the second one taking the later annotators. So:

    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.annotate(document);
    

    Followed by:

    Properties props = new Properties();
    props.put("annotators", "pos, lemma, ner");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
    pipeline.annotate(document);
    

    Note, of course, that some of the annotations (e.g., character offsets) will be unintuitive if you delete intermediate sentences.