I process penn tree bank v2 trees and often encounter "service"-subtrees like these (and several other types)
I can manually add a lot of rules to refine nodes that I actually use further (parse with tags and tokens and without "oh, look there" links or "there must have been a node here" - just like those returned by Stanford parser), but I most often leave some of these service-nodes or huge gaps and "cropped branches" (like, if you remove those -NONE-
nodes above, you'd have SBAR
left with no children at all which is weird).
I wonder if I can remove everything except actual parses (words, tags, punctuation) from a the output of from nltk.corpus import ptb; ptb.parsed_sents()
ones and for all?
Delete any subtree that only dominates traces. In the following, I iterate over subtrees but actually check their children; this makes it easy to delete an empty subtree by modifying the node that contains it.
for sub in some_tree.subtrees():
for n, child in enumerate(sub):
if isinstance(child, str):
continue
if all(leaf.startswith("*") for leaf in child.leaves()):
del sub[n] # Delete this child
I used leaf.startswith("*")
as a simple criterion to detect traces. Replace it with your own as necessary.
Edit: Since you want to delete all nodes containing only subtrees labeled -NONE-
, and each such subtree dominates exactly one leaf, use the following test:
if len(list(child.subtrees(filter=lambda x:x.label()=='-NONE-')))==len(child.leaves()):
del sub[n]