I'm playing around with NLTK, when I try to use the chunk module
enter import nltk as nk
Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
tokens = nk.word_tokenize(Sentence)
tagged = nk.pos_tag(tokens)
entities = nk.chunk.ne_chunk(tagged)
The code runs fine, when I type
>> entities
I get the following error message:
enter code here Out[2]: Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])Traceback (most recent call last):
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\formatters.py", line 343, in __call__
return method()
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\tree.py", line 726, in _repr_png_
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 602, in find_binary
binary_names, url, verbose))
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 596, in find_binary_iter
url, verbose):
File "C:\Users\QP19\AppData\Local\Continuum\Anaconda2\lib\site-packages\nltk\internals.py", line 567, in find_file_iter
raise LookupError('\n\n%s\n%s\n%s' % (div, msg, div))
LookupError:
===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================
According to this post, the solution is to install Ghostscript, since the chunker is trying to use it to display a parse tree, and is looking for one of 3 binaries:
file_names=['gs', 'gswin32c.exe', 'gswin64c.exe']
to use. But even though I installed ghostscript and I can now find the binary in a windows search, but I am still getting the same error.
What do I need to fix or update?
Additional path information:
import os; print os.environ['PATH']
Returns:
C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;C:\Program Files (x86)\Parallels\Parallels Tools\Applications;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Oracle\RPAS14.1\RpasServer\bin;C:\Oracle\RPAS14.1\RpasServer\applib;C:\Program Files (x86)\Java\jre7\bin;C:\Program Files (x86)\Java\jre7\bin\client;C:\Program Files (x86)\Java\jre7\lib;C:\Program Files (x86)\Java\jre7\jre\bin\client;C:\Users\QP19\AppData\Local\Continuum\Anaconda2;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Scripts;C:\Users\QP19\AppData\Local\Continuum\Anaconda2\Library\bin;
In short:
Instead of >>> entities
, do this:
>>> print entities.__repr__()
Or:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities
In long:
The problem lies in you trying to print the output of the ne_chunk
and that will fire ghostscript to get the string and drawing representation of the NE tagged sentence, which is an nltk.tree.Tree
object. And that will require ghostscript so you can use the widget to visualize it.
Let's walkthrough this step by step.
First when you use ne_chunk
, you can directly import it at the top level as such:
from nltk import ne_chunk
And it's advisable to use namespaces for your imports, i.e.:
from nltk import word_tokenize, pos_tag, ne_chunk
And when you use ne_chunk
, it comes from https://github.com/nltk/nltk/blob/develop/nltk/chunk/init.py
It's unclear what kind of function is the pickle loading but after some inspection, we find that there's only one built-in NE chunker that isn't rule-based and since the name of the pickle binary states maxent, we can assume that it's a statistical chunker, so it most probably be comes from the NEChunkParser
object in this: https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py . There are ACE data API functions too, so as the name of the pickle binary.
Now, whenever you can the ne_chunk
function, it's actually calling the
NEChunkParser.parse()
function that returns a nltk.tree.Tree
object: https://github.com/nltk/nltk/blob/develop/nltk/chunk/named_entity.py#L118
class NEChunkParser(ChunkParserI):
"""
Expected input: list of pos-tagged words
"""
def __init__(self, train):
self._train(train)
def parse(self, tokens):
"""
Each token should be a pos-tagged word
"""
tagged = self._tagger.tag(tokens)
tree = self._tagged_to_parse(tagged)
return tree
def _train(self, corpus):
# Convert to tagged sequence
corpus = [self._parse_to_tagged(s) for s in corpus]
self._tagger = NEChunkParserTagger(train=corpus)
def _tagged_to_parse(self, tagged_tokens):
"""
Convert a list of tagged tokens to a chunk-parse tree.
"""
sent = Tree('S', [])
for (tok,tag) in tagged_tokens:
if tag == 'O':
sent.append(tok)
elif tag.startswith('B-'):
sent.append(Tree(tag[2:], [tok]))
elif tag.startswith('I-'):
if (sent and isinstance(sent[-1], Tree) and
sent[-1].label() == tag[2:]):
sent[-1].append(tok)
else:
sent.append(Tree(tag[2:], [tok]))
return sent
If we take a look at the nltk.tree.Tree
ject that's where the ghostscript problems appears when it's trying to call the _repr_png_
function: https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L702:
def _repr_png_(self):
"""
Draws and outputs in PNG for ipython.
PNG is used instead of PDF, since it can be displayed in the qt console and
has wider browser support.
"""
import os
import base64
import subprocess
import tempfile
from nltk.draw.tree import tree_to_treesegment
from nltk.draw.util import CanvasFrame
from nltk.internals import find_binary
_canvas_frame = CanvasFrame()
widget = tree_to_treesegment(_canvas_frame.canvas(), self)
_canvas_frame.add_widget(widget)
x, y, w, h = widget.bbox()
# print_to_file uses scrollregion to set the width and height of the pdf.
_canvas_frame.canvas()['scrollregion'] = (0, 0, w, h)
with tempfile.NamedTemporaryFile() as file:
in_path = '{0:}.ps'.format(file.name)
out_path = '{0:}.png'.format(file.name)
_canvas_frame.print_to_file(in_path)
_canvas_frame.destroy_widget(widget)
subprocess.call([find_binary('gs', binary_names=['gswin32c.exe', 'gswin64c.exe'], env_vars=['PATH'], verbose=False)] +
'-q -dEPSCrop -sDEVICE=png16m -r90 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE -sOutputFile={0:} {1:}'
.format(out_path, in_path).split())
with open(out_path, 'rb') as sr:
res = sr.read()
os.remove(in_path)
os.remove(out_path)
return base64.b64encode(res).decode()
But note that it's strange that the python interpreter would fire _repr_png
instead of __repr__
when you use >>> entities
at the interpreter (see Purpose of Python's __repr__). It couldn't be how the native CPython interpreter work when trying to print out the representation of an object, so we take a look at Ipython.core.formatters
and we see that it allows _repr_png
to be fired at https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L725:
class PNGFormatter(BaseFormatter):
"""A PNG formatter.
To define the callables that compute the PNG representation of your
objects, define a :meth:`_repr_png_` method or use the :meth:`for_type`
or :meth:`for_type_by_name` methods to register functions that handle
this.
The return value of this formatter should be raw PNG data, *not*
base64 encoded.
"""
format_type = Unicode('image/png')
print_method = ObjectName('_repr_png_')
_return_type = (bytes, unicode_type)
And we see that when IPython initializes a DisplayFormatter
object, it tries to activate all formatters: https://github.com/ipython/ipython/blob/master/IPython/core/formatters.py#L66
def _formatters_default(self):
"""Activate the default formatters."""
formatter_classes = [
PlainTextFormatter,
HTMLFormatter,
MarkdownFormatter,
SVGFormatter,
PNGFormatter,
PDFFormatter,
JPEGFormatter,
LatexFormatter,
JSONFormatter,
JavascriptFormatter
]
d = {}
for cls in formatter_classes:
f = cls(parent=self)
d[f.format_type] = f
return d
Note that outside of Ipython
, in the native CPython interpreter, it will only call the __repr__
and not the _repr_png
:
>>> from nltk import ne_chunk
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> Sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> sentence = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sentence)))
>>> entities
Tree('S', [Tree('PERSON', [('Betty', 'NNP')]), Tree('PERSON', [('Botter', 'NNP')]), ('bought', 'VBD'), ('some', 'DT'), ('butter', 'NN'), (',', ','), ('but', 'CC'), ('she', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('butter', 'NN'), ('is', 'VBZ'), ('bitter', 'JJ'), (',', ','), ('I', 'PRP'), ('f', 'VBP'), ('I', 'PRP'), ('put', 'VBD'), ('it', 'PRP'), ('in', 'IN'), ('my', 'PRP$'), ('batter', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('make', 'VB'), ('my', 'PRP$'), ('batter', 'NN'), ('bitter', 'NN'), ('.', '.')])
So now the solution:
Solution 1:
When printing out the string output of the ne_chunk
, you can use
>>> print entities.__repr__()
instead of >>> entities
that way, IPython should explicitly call only the __repr__
instead of call all possible formatters.
Solution 2
If you really need to use the _repr_png_
to visualize the Tree object, then we will need to figure out how to add the ghostscript binary to the NLTK environmental variables.
In your case, it seems like the default nltk.internals
are unable to find the binary. More specifically, we're referring to https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L599
If we go back to https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L726, we see that, it's trying to look for the
env_vars=['PATH']
And when NLTK tries to initialize it's environment variables, it is looking at os.environ
, see https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L495
Note that find_binary
calls find_binary_iter
which calls find_binary_iter
that tries to look for the env_vars
by fetching os.environ
So if we add to the path:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
Now this should work in Ipython:
>>> import os
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> path_to_gs = "C:\Program Files\gs\gs9.19\bin"
>>> os.environ['PATH'] += os.pathsep + path_to_gs
>>> sent = "Betty Botter bought some butter, but she said the butter is bitter, I f I put it in my batter, it will make my batter bitter."
>>> entities = ne_chunk(pos_tag(word_tokenize(sent)))
>>> entities