Search code examples
pythonnlpnltkdepth-first-searchchunking

Any good or better or direct way to get the chunking result from a nltk Tree?


I want to chunk the string to get the groups in a certain height. The original order should be kept and it should also be completly contain all the original words.

import nltk 
height = 2
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern) 
result = NPChunker.parse(sentence)

In [29]: Tree.fromstring(str(result)).pretty_print()
                             S                                      
            _________________|_____________________________          
           NP                        VBD       IN          NP       
   ________|_________________         |        |      _____|____     
the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN

My approach is kind of brute force like below:

In [30]: [list(map(lambda x: x[0], _tree.leaves())) for _tree in result.subtrees(lambda x: x.height()==height)]
Out[30]: [['the', 'little', 'yellow', 'dog'], ['barked'], ['at'], ['the', 'cat']]

I thought there should exist some direct API or something I can use to do chuncking. Any suggestions are highly appreciated.


Solution

  • Nope, there isn't any built-in function in NLTK to return Tree of a certain depth.

    But you can use the depth-first traversal from How to Traverse an NLTK Tree object?

    To be efficient, you can iterate depth-first and only recur if the depth is less than necessary, e.g.

    import nltk 
    sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    
    pattern = """NP: {<DT>?<JJ>*<NN>}
    VBD: {<VBD>}
    IN: {<IN>}"""
    NPChunker = nltk.RegexpParser(pattern) 
    result = NPChunker.parse(sentence)
    
    def traverse_tree(tree, depth=float('inf')):
        """ 
        Traversing the Tree depth-first,
        yield leaves up to `depth` level.
        """
        for subtree in tree:
            if type(subtree) == nltk.tree.Tree:
                if subtree.height() <= depth:
                    yield subtree.leaves()
                    traverse_tree(subtree)
    
    
    list(traverse_tree(result, 2))
    

    [out]:

    [[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')],
     [('barked', 'VBD')],
     [('at', 'IN')],
     [('the', 'DT'), ('cat', 'NN')]]
    

    Another example:

    x = """(S
      (NP the/DT 
          (AP little/JJ yellow/JJ)
           dog/NN)
      (VBD barked/VBD)
      (IN at/IN)
      (NP the/DT cat/NN))"""
    
    list(traverse_tree(Tree.fromstring(x), 2))
    

    [out]:

    [['barked/VBD'], ['at/IN'], ['the/DT', 'cat/NN']]