Typically when you use the corenlp annotation pipeline for say NER you would write the following code
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
I would like to perform sentence splitting i.e. ssplit
in the above pipeline. But then I would like to remove sentences that are too long before I continue the rest of the pipeline. What I have been doing is performing sentence splitting, filtering the sentences by length, then performing NER by applying the entire pipeline i.e. tokenize, ssplit, pos, lemma, ner
. So essentially I have performed tokenize
and ssplit
twice. Is there a more efficient way of doing this? For example performing tokenize
and ssplit
then pausing the pipeline to remove sentences that are too long, then resume the pipeline with pos
, lemma
, and ner
.
You can create two pipeline objects, with the second one taking the later annotators. So:
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
Followed by:
Properties props = new Properties();
props.put("annotators", "pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
pipeline.annotate(document);
Note, of course, that some of the annotations (e.g., character offsets) will be unintuitive if you delete intermediate sentences.