Search code examples
pythonnlpnltkstanford-nlpcontext-free-grammar

how to extract elements from tree.productions()


(1)My goal: To extract left-hand side and right-hand side of a production.

(2)My approach: I am employing stanford parser and nltk tools to extract parsetree of a sentence. My code is below:

corenlp_dir = "/home/corenlp-python/stanford-corenlp-full-2013-11-12/"
parser = corenlp.StanfordCoreNLP(corenlp_path=corenlp_dir)

result_json = json.loads(parser.parse("I have a tree."))
for sentence in result_json["sentences"]:
    t = Tree.fromstring(sentence["parsetree"])
    print t.productions()   # [ROOT -> S, S -> NP VP ., NP -> PRP, PRP -> 'I', VP -> VBP NP, VBP -> 'have', NP -> DT NN, DT -> 'a', NN -> 'tree', . -> '.']

    print t.productions()[1]  # S -> NP VP .
    print type(productions()[1])  # <class 'nltk.grammar.Production'>

    for (i,child) in enumerate(t): 
        print (i,child)  # (0, Tree('S', [Tree('NP', [Tree('PRP', ['I'])]), Tree('VP', [Tree('VBP', ['have']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['tree'])])]), Tree('.', ['.'])])) I can only get one tree.

(3)My question is how I can continue to extract elements from both sides of each production , such as 'S' and 'NP VP .'. Is there any method can be used to solve this problem?

Could anyone help me and maybe point out some directions?


Solution

  • nltk.Tree is actually a subclass of the Python list, so you can access the children of any node c by c[0], c[1], c[2], etc. Note that NLTK trees are not explicitly binary by design, so your notion of "left" and "right" might have to be enforced somewhere in a contract.

    Assuming the tree is binary, you can access the left child of a node with c[0], and the right with c[1]. For your second task:

    But what I want to do is to extract the left-hand side of a production and gather right-hand side of all productions with the same left-hand side.

    If I understand correctly, you can traverse the tree and build up a dict as you go, where the keys are left-hand sides and the values are lists of possible right-hand productions. I'm not sure if nltk.Tree objects are hashable / immutable (if not, they wouldn't be usable as dict keys), but you could use the string form of the Tree objects as keys in any case.