Search code examples
stanford-nlp

How to split the result of PTBTokenizer into sentences?


I know I could use DocumentPreprocessor to split a text into sentence. But it does not provide enough information if one wants to convert the tokenized text back to the original text. So I have to use PTBTokenizer, which has an invertible option.

However, PTBTokenizer simply returns an iterator of all the tokens (CoreLabels) in a document. It does not split the document into sentences.

The documentation says:

The output of PTBTokenizer can be post-processed to divide a text into sentences.

But this is obviously not trivial.

Is there a class in the Stanford NLP library that can take as input a sequence of CoreLabels, and output sentences? Here's what I mean exactly:

List<List<CoreLabel>> split(List<CoreLabel> documentTokens);

Solution

  • I would suggest you use the StanfordCoreNLP class. Here is some sample code:

    import java.io.*;
    import java.util.*;
    import edu.stanford.nlp.io.*;
    import edu.stanford.nlp.ling.*;
    import edu.stanford.nlp.pipeline.*;
    import edu.stanford.nlp.trees.*;
    import edu.stanford.nlp.semgraph.*;
    import edu.stanford.nlp.ling.CoreAnnotations.*;
    import edu.stanford.nlp.util.*;
    
    public class PipelineExample {
    
        public static void main (String[] args) throws IOException {
            // build pipeline                                                                                                                                         
            Properties props = new Properties();
            props.setProperty("annotators","tokenize, ssplit, pos");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
            String text = " I am a sentence.  I am another sentence.";
            Annotation annotation = new Annotation(text);
            pipeline.annotate(annotation);
            System.out.println(annotation.get(TextAnnotation.class));
            List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
            for (CoreMap sentence : sentences) {
                System.out.println(sentence.get(TokensAnnotation.class));
                for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
                    System.out.println(token.after() != null);
                    System.out.println(token.before() != null);
                    System.out.println(token.beginPosition());
                    System.out.println(token.endPosition());
                }
            }
        }
    
    }