Search code examples
pythonparsingtreenltksubtree

Creating a full nltk parse tree from a list of nltk subtrees in python 3.5


I have list of subtrees which I derived from a parse history formatted as follows:

The parse history:

parse = [('S', 0), ('NP', 1), ('Det', 0), ('N', 0), ('VP', 1), ('V', 4), ('NP', 2), ('NP', 0), ('PN', 1), ('NP', 1), ('Det', 0), ('N', 3)]

Each tuple in the list has a key to a grammar dictionary which contains a list of rules. The second item in the tuple is the index of the rule for that given key.

The grammar is:

grammar = {'S': [['NP', 'VP']],
               'NP': [['PN'], ['Det', 'N']],
               'VP': [['V'], ['V', 'NP', 'NP']],
               'PN': [['John'], ['Mary'], ['Bill']],
               'Det': [['the'], ['a']],
               'N': [['man'], ['woman'], ['drill sergeant'], ['dog']],
               'V': [['slept'], ['cried'], ['assaulted'],
                     ['devoured'], ['showed']]}

The list of subtrees is:

[Tree('S', ['NP', 'VP']), Tree('NP', ['Det', 'N']), Tree('Det', ['the']), Tree('N', ['man']), Tree('VP', ['V', 'NP', NP]), Tree('V', ['showed']), Tree('NP', ['PN']), Tree('PN', ['Mary']), Tree('NP', ['Det', 'N']), Tree('Det', ['the']), Tree('N', ['dog'])]

I created the list of subtrees using the following code:

for item in parse:
        apple = Tree(item[0], grammar[item[0]][item[1]])
        trees.append(apple)

The output I got when I printed the trees (which I know isn't the correct method but it at least shows the subtrees) is as follows:

(S NP VP)
(NP Det N)
(Det the)
(N man)
(VP V NP)
(V showed)
(NP NP NP)
(NP PN)
(PN Mary)
(NP Det N)
(Det the)
(N dog)

Thanks for the help!

::EDIT::

The correct output should look like this:

(S(NP(Det the)(N man))(VP(V showed)(NP(PN Mary))(NP(Det the)(N dog))))

Solution

  • You need to recursively build the tree, but you need to distinguish between terminals and non-terminals. Btw. your parse sequence seems wrong. I hacked this up:

    def build_tree(parse):
      assert(parse)
      rule_head = parse[0][0]
      rule_body = grammar[rule_head][parse[0][1]]
      tree_body = []
      rest = parse[1:]
      for r in rule_body:
        if non_term(r):
            (subtree,rest) = build_tree(rest)
            tree_body.append(subtree)
        else:
            tree_body.append(r)
    
      return (tree(rule_head,tree_body), rest)