Search code examples
pythonparsingnlpparse-tree

How to read constituency based parse tree


I have a corpus of sentences that were preprocessed by Stanford's CoreNLP systems. One of the things it provides is the sentence's Parse Tree (Constituency-based). While I can understand a parse tree when it's drawn (like a tree), I'm not sure how to read it in this format:

E.g.:

          (ROOT
          (FRAG
          (NP (NN sent28))
          (: :)
          (S
          (NP (NNP Rome))
          (VP (VBZ is)
          (PP (IN in)
          (NP
          (NP (NNP Lazio) (NN province))
          (CC and)
          (NP
          (NP (NNP Naples))
          (PP (IN in)
          (NP (NNP Campania))))))))
          (. .)))

The original sentence is:

sent28: Rome is in Lazio province and Naples in Campania .

How am I supposed to read this tree, or alternatively, is there a code (in python) that does it properly? Thanks.


Solution

  • NLTK has a class for reading parse trees: nltk.tree.Tree. The relevant method is called fromstring. You can then iterate its subtrees, leaves, etc...

    As an aside: you might want to remove the bit that says sent28: as it confuses the parser (it's also not a part of the sentence). You are not getting a full parse tree, but just a sentence fragment.