Search code examples
pythonnltksubtree

Subtree Extraction NLTK Tree


I need a little help with NLTK trees.

I am trying to extract some subtrees from this french tree:

Original Tree

(SENT (NP-SUJ↓ (PRO=H Personne)) (VN=H (ADV* ne) (V=H sait)) (ADV* exactement) (PONCT* .))

I only want to extract trees having '=H' at the end of the POS label and then add the parent node:

Like this: (NP-SUJ↓ (PRO=H Personne)) and this: (VN=H (V=H sait))

And I wrote a function to do so:

def AddParent(tree):
    grammar = []
    for subtree in tree.subtrees():
        if subtree.height()==2 and subtree.label().endswith("=H"):
            PartialTree = ParentedTree(subtree.parent().label(), 
                               [ParentedTree(subtree.label(), subtree)])
            grammar.append(PartialTree)
    return grammar

#Test
pt = ParentedTree.fromstring("(SENT (NP-SUJ↓ (PRO=H Personne)) (VN=H (ADV* ne) (V=H sait)) (ADV* exactement) (PONCT* .))")
AddParent(pt)
[ParentedTree('NP-SUJ↓', [ParentedTree('PRO=H', ['Personne'])]), 
ParentedTree('VN=H', [ParentedTree('V=H', ['sait'])])]

I have two issues here: first, I want to keep adding information to those subtrees from the original tree. For instance, I want to keep adding ancestor nodes and then children, to do something like this :

(SENT (NP-SUJ↓ ) (VN=H (V=H sait)))

Subtree

But I lost track of the original tree...

Second, the parent() function returns all the subtrees contained in it. And I just want to have specific nodes.

What would be the good approach to extract this last subtree???

Thank you very much for your help! I am new at this but I really like it!


Solution

  • I can't say I understand your complaint about parent() (perhaps you meant subtrees()?), but there are easier ways to get your hands on subtrees:

    1. Superficial improvement: The subtrees() function accepts a filter argument, so you don't have to check the returned subtrees in your code:

      for subtree in tree.subtrees(filter=lambda t: t.label().endswith("=H"))
      
    2. A subtree is a reference to a subpart of the original tree. If you don't modify it, it is still part of the original and you can ascend the tree (since you use "parented" trees.) In fact, note that if you make modifications to the contents of a subtree, the original tree will be modified. But instead of embedding the tree you found under a new node, build a wholly new copy:

      partial = ParentedTree(subtree.parent().label(), [ subtree.copy() ])
      

      Then you you can freely delete or alter branches in the copy, and you still have the original tree and subtree to work with.

    3. Although you can use the parent() method to climb up the tree, I often find it more convenient to work with "tree positions". A tree position is a tuple of integers, which functions as a path down the tree (use it like an integer index on a list). To find the parent, you just need to slice off the last element of the treeposition:

      for postn in tree.treepositions():
          if tree[postn].label().endswith("=H"):
              parentpos = postn[:-1]   # everything but the last element
              partial = Tree(tree[parentpos].label(), [ tree[postn] ])
      

      Note that if you use this method, you don't need the parent() method anymore and hence you might as well use Tree, not ParentedTree.

    The above probably doesn't do precisely what you wanted (it's kind of hard to see what you are doing exactly), but I hope you get the picture.