How to get rid of -NONE- and T-i in ptb parse trees using nltk?

I process penn tree bank v2 trees and often encounter "service"-subtrees like these (and several other types)

I can manually add a lot of rules to refine nodes that I actually use further (parse with tags and tokens and without "oh, look there" links or "there must have been a node here" - just like those returned by Stanford parser), but I most often leave some of these service-nodes or huge gaps and "cropped branches" (like, if you remove those -NONE- nodes above, you'd have SBAR left with no children at all which is weird).

I wonder if I can remove everything except actual parses (words, tags, punctuation) from a the output of from nltk.corpus import ptb; ptb.parsed_sents() ones and for all?

Solution

Delete any subtree that only dominates traces. In the following, I iterate over subtrees but actually check their children; this makes it easy to delete an empty subtree by modifying the node that contains it.

for sub in some_tree.subtrees():
    for n, child in enumerate(sub):
        if isinstance(child, str):
            continue
        if all(leaf.startswith("*") for leaf in child.leaves()):
            del sub[n]  # Delete this child

I used leaf.startswith("*") as a simple criterion to detect traces. Replace it with your own as necessary.

Edit: Since you want to delete all nodes containing only subtrees labeled -NONE-, and each such subtree dominates exactly one leaf, use the following test:

    if len(list(child.subtrees(filter=lambda x:x.label()=='-NONE-')))==len(child.leaves()):
        del sub[n]

How to get rid of -NONE- and *T*-i in ptb parse trees using nltk?

How to get rid of -NONE- and T-i in ptb parse trees using nltk?