Search code examples
javanlpstanford-nlp

Input Penn Treebank constituent trees in a Stanford CoreNLP pipeline


I am using the OpenIE tool from the Stanford NLP libraries to get minimal clauses from a sentence. Here is what I have come up with so far (largely inspired from their demo code):

public static void main(String[] args) {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

    Annotation doc = new Annotation("Obama was born in Hawaii. He is our president.");
    pipeline.annotate(doc);

    for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {

        OpenIE split = new OpenIE(props);
        List<SentenceFragment> clauses = split.clausesInSentence(sentence);
        for (SentenceFragment clause : clauses) {
            List<SentenceFragment> short_clauses = split.entailmentsFromClause(clause);
            for (SentenceFragment short_clause : short_clauses){
                System.out.println(short_clause.parseTree);
            }
        }
    }
}

I now want to use PTB constituent trees as input instead of plain text and then only use the depparse, natlog and openie annotators to get the clauses.

I know that I can use PTB trees as input to the Stanford parser (as explained here) but have not figured out how to integrate that in the pipeline.


Solution

  • This is I think actually nontrivial. If someone has a clean way to do this in the pipeline, chime in! But, if I were to do it I'd probably just call the component bits of code manually. This means:

    • Create a SemanticGraph object from the GrammaticalStructure from the constituency tree.

    • Add a lemma annotation to each IndexedWord in the semantic graph. This can be done by calling Morphology#lemma(word, posTag) on each token, and setting the LemmaAnnotation to this.

    • Running through the natural logic annotator is going to be tricky. One option is to mock an Annotation object and push it through the usual annotate() method. But, if you don't care too much about the OpenIE system recognizing negation, you can skip this annotator by adding the value Polarity#DEFAULT to each token in the SemanticGraph on the PolarityAnnotation key.

    • Now your dependency tree should be ready to pass through the OpenIE annotator. You want to make three calls here:

      • OpenIE#clausesInSentence(SemanticGraph) will generate a collection of clauses from a given graph.
      • OpenIE#entailmentsFromClause(SentenceFragment) will generate short entailments from each clause. You want to pass each of the outputs from the above function into this, and collect all the resulting fragments.
      • OpenIE#relationsInFragment(SentenceFragment) will segment a short entailment into a relation triple. It returns an Optional -- most fragments don't segment into any triple. You want to pass each of the short entailments collected from the above call into this function, and collect the relation triples that are defined in the output of this function. These are your OpenIE triples.

    Out of curiosity, what are you trying to do in the end? Perhaps there's an easier way to accomplish the same goal.