Search code examples
pythontreenltkpprint

How can I pretty print a nltk tree object?


I want to view if the result below is what I need in a visual way:

import nltk 
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

pattern = """NP: {<DT>?<JJ>*<NN>}
VBD: {<VBD>}
IN: {<IN>}"""
NPChunker = nltk.RegexpParser(pattern) 
result = NPChunker.parse(sentence)

source: https://stackoverflow.com/a/31937278/3552975

I don't why I cannot pretty_print the result.

result.pretty_print()

The error reads that TypeError: not all arguments converted during string formatting. I use Python3.5, nltk3.3.


Solution

  • If you're looking for a bracketed parse output, you can use Tree.pprint():

    >>> import nltk 
    >>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    >>> 
    >>> pattern = """NP: {<DT>?<JJ>*<NN>}
    ... VBD: {<VBD>}
    ... IN: {<IN>}"""
    >>> NPChunker = nltk.RegexpParser(pattern) 
    >>> result = NPChunker.parse(sentence)
    >>> result.pprint()
    (S
      (NP the/DT little/JJ yellow/JJ dog/NN)
      (VBD barked/VBD)
      (IN at/IN)
      (NP the/DT cat/NN))
    

    But most probably you're looking for

                                 S                                      
                _________________|_____________________________          
               NP                        VBD       IN          NP       
       ________|_________________         |        |      _____|____     
    the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN
    

    Lets dig into the code from the Tree.pretty_print() https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L692 :

    def pretty_print(self, sentence=None, highlight=(), stream=None, **kwargs):
        """
        Pretty-print this tree as ASCII or Unicode art.
        For explanation of the arguments, see the documentation for
        `nltk.treeprettyprinter.TreePrettyPrinter`.
        """
        from nltk.treeprettyprinter import TreePrettyPrinter
        print(TreePrettyPrinter(self, sentence, highlight).text(**kwargs),
              file=stream)
    

    It's creating a TreePrettyPrinter object, https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L50

    class TreePrettyPrinter(object):
        def __init__(self, tree, sentence=None, highlight=()):
            if sentence is None:
                leaves = tree.leaves()
                if (leaves and not any(len(a) == 0 for a in tree.subtrees())
                        and all(isinstance(a, int) for a in leaves)):
                    sentence = [str(a) for a in leaves]
                else:
                    # this deals with empty nodes (frontier non-terminals)
                    # and multiple/mixed terminals under non-terminals.
                    tree = tree.copy(True)
                    sentence = []
                    for a in tree.subtrees():
                        if len(a) == 0:
                            a.append(len(sentence))
                            sentence.append(None)
                        elif any(not isinstance(b, Tree) for b in a):
                            for n, b in enumerate(a):
                                if not isinstance(b, Tree):
                                    a[n] = len(sentence)
                                    sentence.append('%s' % b)
            self.nodes, self.coords, self.edges, self.highlight = self.nodecoords(
                    tree, sentence, highlight)
    

    And it looks like the line raising the error is sentence.append('%s' % b) https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L97

    Question is why did it raise a TypeError?

    TypeError: not all arguments converted during string formatting
    

    If we look carefully, it looks let we can use print('%s' % b) for most basic python types

    # String
    >>> x = 'abc'
    >>> type(x)
    <class 'str'>
    >>> print('%s' % x)
    abc
    
    # Integer
    >>> x = 123
    >>> type(x)
    <class 'int'>
    >>> print('%s' % x)
    123
    
    # Float 
    >>> x = 1.23
    >>> type(x)
    <class 'float'>
    >>> print('%s' % x)
    1.23
    
    # Boolean
    >>> x = True
    >>> type(x)
    <class 'bool'>
    >>> print('%s' % x)
    True
    

    Surprisingly, it even works on list!

    >>> x = ['abc', 'def']
    >>> type(x)
    <class 'list'>
    >>> print('%s' % x)
    ['abc', 'def']
    

    But it got stymied by tuple!!

    >>> x = ('DT', 123)
    >>> x = ('abc', 'def')
    >>> type(x)
    <class 'tuple'>
    >>> print('%s' % x)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: not all arguments converted during string formatting
    

    So if we go back to the code at https://github.com/nltk/nltk/blob/develop/nltk/treeprettyprinter.py#L95

    if not isinstance(b, Tree):
        a[n] = len(sentence)
        sentence.append('%s' % b)
    

    Since we know that sentence.append('%s' % b) can't handle tuple, adding a check for tuple type and concatenating items in the tuple somehow and converting into a str will produce the nice pretty_print:

    if not isinstance(b, Tree):
        a[n] = len(sentence)
        if type(b) == tuple:
            b = '/'.join(b)
        sentence.append('%s' % b)
    

    [out]:

                                 S                                      
                _________________|_____________________________          
               NP                        VBD       IN          NP       
       ________|_________________         |        |      _____|____     
    the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN
    

    Without changing the nltk code, is it possible to still get the pretty print?

    Lets look at how the result i.e. a Tree object looks like:

    Tree('S', [Tree('NP', [('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')]), Tree('VBD', [('barked', 'VBD')]), Tree('IN', [('at', 'IN')]), Tree('NP', [('the', 'DT'), ('cat', 'NN')])])
    

    It looks like the leaves are kept as list of tuples of string, e.g. [('the', 'DT'), ('cat', 'NN')], so we could do some hack such that it becomes list of string, e.g. [('the/DT'), ('cat/NN')], so that Tree.pretty_print() will play nice.

    Since we know that Tree.pprint() helps use concatenate the tuples of strings to the form we want, i.e.

    (S
      (NP the/DT little/JJ yellow/JJ dog/NN)
      (VBD barked/VBD)
      (IN at/IN)
      (NP the/DT cat/NN))
    

    We can simply output to a bracketed parse string, then re-read the parse Tree object with Tree.fromstring():

    from nltk import Tree
    Tree.fromstring(str(result)).pretty_print()
    

    Finalment:

    import nltk 
    sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked","VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
    
    pattern = """NP: {<DT>?<JJ>*<NN>}
    VBD: {<VBD>}
    IN: {<IN>}"""
    NPChunker = nltk.RegexpParser(pattern) 
    result = NPChunker.parse(sentence)
    
    Tree.fromstring(str(result)).pretty_print()
    

    [out]:

                                 S                                      
                _________________|_____________________________          
               NP                        VBD       IN          NP       
       ________|_________________         |        |      _____|____     
    the/DT little/JJ yellow/JJ dog/NN barked/VBD at/IN the/DT     cat/NN