Search code examples
pythonenvironment-variablesipythonnltkghostscript

Can't find ghostscript in NLTK?


I'm playing around with NLTK, when I try to use the chunk module

enter import nltk as nk
Sentence  = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
tokens = nk.word_tokenize(Sentence)
tagged = nk.pos_tag(tokens)
entities = nk.chunk.ne_chunk(tagged) 

The code runs fine, when I type

>> entities 

I get the following error message:

enter code here Out[2]: Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])Traceback (most recent call last):

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\formatters.py", line 343, in __call__
return method()

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\tree.py", line 726, in _repr_png_
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 602, in find_binary
binary_names, url, verbose))

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 596, in find_binary_iter
url, verbose):

File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 567, in find_file_iter
raise LookupError('\n\n%s\n%s\n%s' % (div, msg, div))

LookupError: 

===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================

According to this post, the solution is to install Ghostscript, since the chunker is trying to use it to display a parse tree, and is looking for one of 3 binaries:

file_names=['gs', 'gswin32c.exe', 'gswin64c.exe']

to use. But even though I installed ghostscript and I can now find the binary in a windows search, but I am still getting the same error.

What do I need to fix or update?


Additional path information:

import os; print os.environ['PATH']

Returns:

C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Program Files (x86)\Parallels\Parallels Tools\Applications;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Oracle\RPAS14.1\RpasServer\bin;C:\Oracle\RPAS14.1\RpasServer\applib;C:\Program Files (x86)\Java\jre7\bin;C:\Program Files (x86)\Java\jre7\bin\client;C:\Program Files (x86)\Java\jre7\lib;C:\Program Files (x86)\Java\jre7\jre\bin\client;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;  

Solution

  • In short:

    Instead of >>> entities, do this:

    >>> print entities.__repr__()
    

    Or:

    >>> import os
    >>> from nltk import word_tokenize, pos_tag, ne_chunk
    >>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
    >>> os.environ['PATH'] += os.pathsep + path_to_gs
    >>> sent = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
    >>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
    >>> entities
    

    In long:

    The problem lies in you trying to print the output of the ne_chunk and that will fire ghostscript to get the string and drawing representation of the NE tagged sentence, which is an nltk.tree.Tree object. And that will require ghostscript so you can use the widget to visualize it.

    Let's walkthrough this step by step.

    First when you use ne_chunk, you can directly import it at the top level as such:

    from nltk import ne_chunk
    

    And it's advisable to use namespaces for your imports, i.e.:

    from nltk import word_tokenize, pos_tag, ne_chunk
    

    And when you use ne_chunk, it comes from https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py

    It's unclear what kind of function is the pickle loading but after some inspection, we find that there's only one built-in NE chunker that isn't rule-based and since the name of the pickle binary states maxent, we can assume that it's a statistical chunker, so it most probably be comes from the NEChunkParser object in this: https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py . There are ACE data API functions too, so as the name of the pickle binary.

    Now, whenever you can the ne_chunk function, it's actually calling the NEChunkParser.parse() function that returns a nltk.tree.Tree object: https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118

    class NEChunkParser(ChunkParserI):
        """
        Expected input: list of pos-tagged words
        """
        def __init__(self, train):
            self._train(train)
    
        def parse(self, tokens):
            """
            Each token should be a pos-tagged word
            """
            tagged = self._tagger.tag(tokens)
            tree = self._tagged_to_parse(tagged)
            return tree
    
        def _train(self, corpus):
            # Convert to tagged sequence
            corpus = [self._parse_to_tagged(s) for s in corpus]
    
            self._tagger = NEChunkParserTagger(train=corpus)
    
        def _tagged_to_parse(self, tagged_tokens):
            """
            Convert a list of tagged tokens to a chunk-parse tree.
            """
            sent = Tree('S', [])
    
            for (tok,tag) in tagged_tokens:
                if tag == 'O':
                    sent.append(tok)
                elif tag.startswith('B-'):
                    sent.append(Tree(tag[2:], [tok]))
                elif tag.startswith('I-'):
                    if (sent and isinstance(sent[-1], Tree) and
                        sent[-1].label() == tag[2:]):
                        sent[-1].append(tok)
                    else:
                        sent.append(Tree(tag[2:], [tok]))
            return sent
    

    If we take a look at the nltk.tree.Treeject that's where the ghostscript problems appears when it's trying to call the _repr_png_ function: https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L702:

    def _repr_png_(self):
        """
        Draws and outputs in PNG for ipython.
        PNG is used instead of PDF, since it can be displayed in the qt console and
        has wider browser support.
        """
        import os
        import base64
        import subprocess
        import tempfile
        from nltk.draw.tree import tree_to_treesegment
        from nltk.draw.util import CanvasFrame
        from nltk.internals import find_binary
        _canvas_frame = CanvasFrame()
        widget = tree_to_treesegment(_canvas_frame.canvas(), self)
        _canvas_frame.add_widget(widget)
        x, y, w, h = widget.bbox()
        # print_to_file uses scrollregion to set the width and height of the pdf.
        _canvas_frame.canvas()['scrollregion'] = (0, 0, w, h)
        with tempfile.NamedTemporaryFile() as file:
            in_path = '{0:}.ps'.format(file.name)
            out_path = '{0:}.png'.format(file.name)
            _canvas_frame.print_to_file(in_path)
            _canvas_frame.destroy_widget(widget)
            subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
                            '-q -dEPSCrop -sDEVICE=png16m -r90 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE -sOutputFile={0:} {1:}'
                            .format(out_path, in_path).split())
            with open(out_path, 'rb') as sr:
                res = sr.read()
            os.remove(in_path)
            os.remove(out_path)
            return base64.b64encode(res).decode()
    

    But note that it's strange that the python interpreter would fire _repr_png instead of __repr__ when you use >>> entities at the interpreter (see Purpose of Python's __repr__). It couldn't be how the native CPython interpreter work when trying to print out the representation of an object, so we take a look at Ipython.core.formatters and we see that it allows _repr_png to be fired at https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L725:

    class PNGFormatter(BaseFormatter):
        """A PNG formatter.
        To define the callables that compute the PNG representation of your
        objects, define a :meth:`_repr_png_` method or use the :meth:`for_type`
        or :meth:`for_type_by_name` methods to register functions that handle
        this.
        The return value of this formatter should be raw PNG data, *not*
        base64 encoded.
        """
        format_type = Unicode('image/png')
    
        print_method = ObjectName('_repr_png_')
    
        _return_type = (bytes, unicode_type)
    

    And we see that when IPython initializes a DisplayFormatter object, it tries to activate all formatters: https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66

    def _formatters_default(self):
        """Activate the default formatters."""
        formatter_classes = [
            PlainTextFormatter,
            HTMLFormatter,
            MarkdownFormatter,
            SVGFormatter,
            PNGFormatter,
            PDFFormatter,
            JPEGFormatter,
            LatexFormatter,
            JSONFormatter,
            JavascriptFormatter
        ]
        d = {}
        for cls in formatter_classes:
            f = cls(parent=self)
            d[f.format_type] = f
        return d
    

    Note that outside of Ipython, in the native CPython interpreter, it will only call the __repr__ and not the _repr_png:

    >>> from nltk import ne_chunk
    >>> from nltk import word_tokenize, pos_tag, ne_chunk
    >>> Sentence  = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
    >>> sentence  = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
    >>> entities = ne_chunk(pos_tag(word_tokenize(sentence)))
    >>> entities
    Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])
    

    So now the solution:

    Solution 1:

    When printing out the string output of the ne_chunk, you can use

    >>> print entities.__repr__()
    

    instead of >>> entities that way, IPython should explicitly call only the __repr__ instead of call all possible formatters.

    Solution 2

    If you really need to use the _repr_png_ to visualize the Tree object, then we will need to figure out how to add the ghostscript binary to the NLTK environmental variables.

    In your case, it seems like the default nltk.internals are unable to find the binary. More specifically, we're referring to https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599

    If we go back to https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726, we see that, it's trying to look for the

    env_vars=['PATH']
    

    And when NLTK tries to initialize it's environment variables, it is looking at os.environ, see https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L495

    Note that find_binary calls find_binary_iter which calls find_binary_iter that tries to look for the env_vars by fetching os.environ

    So if we add to the path:

    >>> import os
    >>> from nltk import word_tokenize, pos_tag, ne_chunk
    >>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
    >>> os.environ['PATH'] += os.pathsep + path_to_gs
    

    Now this should work in Ipython:

    >>> import os
    >>> from nltk import word_tokenize, pos_tag, ne_chunk
    >>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
    >>> os.environ['PATH'] += os.pathsep + path_to_gs
    >>> sent = "Betty Botter bought some butter, but she said the butter is  bitter, I f I put it in my batter, it will make my batter bitter."
    >>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
    >>> entities