Search code examples
pythonnltkparse-tree

How to get rid of -NONE- and *T*-i in ptb parse trees using nltk?


I process penn tree bank v2 trees and often encounter "service"-subtrees like these (and several other types)

enter image description here

I can manually add a lot of rules to refine nodes that I actually use further (parse with tags and tokens and without "oh, look there" links or "there must have been a node here" - just like those returned by Stanford parser), but I most often leave some of these service-nodes or huge gaps and "cropped branches" (like, if you remove those -NONE- nodes above, you'd have SBAR left with no children at all which is weird).

I wonder if I can remove everything except actual parses (words, tags, punctuation) from a the output of from nltk.corpus import ptb; ptb.parsed_sents() ones and for all?


Solution

  • Delete any subtree that only dominates traces. In the following, I iterate over subtrees but actually check their children; this makes it easy to delete an empty subtree by modifying the node that contains it.

    for sub in some_tree.subtrees():
        for n, child in enumerate(sub):
            if isinstance(child, str):
                continue
            if all(leaf.startswith("*") for leaf in child.leaves()):
                del sub[n]  # Delete this child
    

    I used leaf.startswith("*") as a simple criterion to detect traces. Replace it with your own as necessary.

    Edit: Since you want to delete all nodes containing only subtrees labeled -NONE-, and each such subtree dominates exactly one leaf, use the following test:

        if len(list(child.subtrees(filter=lambda x:x.label()=='-NONE-')))==len(child.leaves()):
            del sub[n]