I'm trying to use the Hunpos tagger for POS tagging with NLTK instead of the traditional pos_tag()
, but I'm having some trouble with loading the binary english.model
or en_wsj.model
.
In fact, I'm in linux mint and I put them in /usr/local/bin
, set the HUNPOS
environment variable to this path, and even tried to pass this path to the parameter path_to_bin
used in the __init__
of nltk/tag/hunpos.py
file, but when it recognizes the file, it throws this error:
>>> ht = HunposTagger('en_wsj.model','/usr/local/bin/en_wsj.model')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.4-py2.7.egg/nltk/tag/hunpos.py", line 89, in __init__
shell=False, stdin=PIPE, stdout=PIPE, stderr=PIPE)
File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error
Does anyone got an idea about what is happening?
I guess I found a way to do it. For those who were having the same problem, I recommend you to download the source code, build it and call it in a way different from what is described in NLTK docs. As it weren't trivial for me, I'm putting it here step-by-step:
Under Unix:
1) Download Subversion SVN if you don't have it and check out the project source code:
svn checkout http://hunpos.googlecode.com/svn/trunk/ hunpos-read-only
This will create a trunk
directory where you checked out.
2) Then, to be able to successfully build it, you might need ocamlbuild
for automatic compiling of Objective Caml. sudo apt-get install ocaml-nox
should handle this.
3) cd
to the trunk
directory (where you downloaded Hunpos source code) and do
./build.sh build
4) At this point, you shall have a binary file tagger.native
in your trunk
directory. Put the whole trunk
directory in your /usr/local/bin
(you may need to do it as super user).
5) Download the en_wsj.model.gz
file here, unzip it and put the en_wsj.model
binary also in usr/local/bin
.
6) Finally, in your python script, you may create an instance of HunposTagger
class passing the paths to both files you have created previously, something very close to:
>>> from nltk.tag.hunpos import HunposTagger
>>> ht = HunposTagger(path_to_model='/usr/local/bin/en_wsj.model', \
path_to_bin= '/usr/local/bin/trunk/tagger.native')
>>> ht.tag('I want to go to San Francisco next year'.split())
[('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('go', 'VB'), ('to', 'TO'),
('San', 'NNP'), ('Francisco', 'NNP'), ('next', 'JJ'), ('year', 'NN')]
>>> ht.close()
(Don't forget to close... if you don't like to close, you may use the with
statement as well)
7) If you still have some trouble, try to set an environmental variable HUNPOS
to /usr/local/bin/trunk
. To do this, you may add the following line to your ~/.bashrc
(or ~/.bash_profile
in MacOS):
export HUNPOS=/usr/local/bin/trunk
and restart your terminal.
That worked for me, but if someone has a better, shorter or simpler way to set this up, please I'd love to hear :)