Search code examples
javanlpstanford-nlp

Clause Segmentation using Stanford OpenIE


I'm in a search of a good tool for segmenting complex sentences into clauses. Since I use CoreNLP tools for parsing, I got to know that OpenIE deals with clause segmentation in the process of extracting the relation triples from a sentence. Currently, I use the sample code provided in the OpenIEDemo class from the github repository but it doesn't properly segment the sentence into clauses. Here is the code:

// Create the Stanford CoreNLP pipeline
Properties props = PropertiesUtils.asProperties(
        "annotators", "tokenize,ssplit,pos,lemma,parse,natlog,openie");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
//Annotate sample sentence
text = "I don't think he will be able to handle this.";

Annotation doc = new Annotation(text);
pipeline.annotate(doc);

// Loop over sentences in the document
int sentNo = 0;
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
      List<SentenceFragment> clauses = new OpenIE(props).clausesInSentence(sentence);
  for (SentenceFragment clause : clauses) {
    System.out.println("Clause: "+clause.toString());
  }
}

I expect the get as output three clauses:

  • I don't think
  • he will be able
  • to handle this

instead, the code returns the exact same input:

  • I do n't think he will be able to handle this

However, the sentence

Obama is born in Hawaii and he is no longer our president.

gets two clauses:

  • Obama is born in Hawaii and he is no longer our president
  • he is no longer our president

(seems that the coordinating conjunction is a good segmentation indicator)

Is OpenIE generally used for clause segmentation and if so, how to do it properly?

Any other practical approaches/tools on clause segmentation are welcome. Thanks in advance.


Solution

  • So, the clause segmenter is a bit more tightly integrated with OpenIE than the name would imply. The goal of the module is to produce logically entailed clauses, which can then be shortened into logically entailed sentence fragments. Going through your two examples:

    1. I don't think he will be able to handle this.

      None of the three clauses are I think entailed from the original sentence:

      • "I don't think" -- you likely still "think," even if you don't think something is true.
      • "He will be able" -- If you "think the world is flat," it doesn't mean that the world is flat. Similarly, if you "think he'll be able" it doesn't mean he'll be able.
      • "to handle this" -- I'm not sure this is a clause... I'd group this with "He will be able to handle this," with "able to" being treated as a single verb.
    2. Obama is born in Hawaii and he is no longer our president.

      Naturally the two clauses should be "Obama was born in Hawaii" and "He is no longer our president." Nonetheless, the clause splitter outputs the original sentence in place of the first clause, in expectation that the next step of the OpenIE extractor will strip off the "conj:and" edge.