I know I could use DocumentPreprocessor
to split a text into sentence. But it does not provide enough information if one wants to convert the tokenized text back to the original text. So I have to use PTBTokenizer
, which has an invertible
option.
However, PTBTokenizer
simply returns an iterator of all the tokens (CoreLabel
s) in a document. It does not split the document into sentences.
The documentation says:
The output of PTBTokenizer can be post-processed to divide a text into sentences.
But this is obviously not trivial.
Is there a class in the Stanford NLP library that can take as input a sequence of CoreLabel
s, and output sentences? Here's what I mean exactly:
List<List<CoreLabel>> split(List<CoreLabel> documentTokens);
I would suggest you use the StanfordCoreNLP class. Here is some sample code:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class PipelineExample {
public static void main (String[] args) throws IOException {
// build pipeline
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = " I am a sentence. I am another sentence.";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
System.out.println(annotation.get(TextAnnotation.class));
List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
System.out.println(sentence.get(TokensAnnotation.class));
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println(token.after() != null);
System.out.println(token.before() != null);
System.out.println(token.beginPosition());
System.out.println(token.endPosition());
}
}
}
}