Search code examples
nltkpos-tagger

How to correctly set Hunpos tagger in NLTK for POS tagging in english?


I'm trying to use the Hunpos tagger for POS tagging with NLTK instead of the traditional pos_tag(), but I'm having some trouble with loading the binary english.model or en_wsj.model.

In fact, I'm in linux mint and I put them in /usr/local/bin, set the HUNPOS environment variable to this path, and even tried to pass this path to the parameter path_to_bin used in the __init__ of nltk/tag/hunpos.py file, but when it recognizes the file, it throws this error:

>>> ht = HunposTagger('en_wsj.model','/usr/local/bin/en_wsj.model')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.4-py2.7.egg/nltk/tag/hunpos.py", line 89, in __init__
    shell=False, stdin=PIPE, stdout=PIPE, stderr=PIPE)
  File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
    raise child_exception
OSError: [Errno 8] Exec format error

Does anyone got an idea about what is happening?


Solution

  • I guess I found a way to do it. For those who were having the same problem, I recommend you to download the source code, build it and call it in a way different from what is described in NLTK docs. As it weren't trivial for me, I'm putting it here step-by-step:

    Under Unix:

    1) Download Subversion SVN if you don't have it and check out the project source code:

    svn checkout http://hunpos.googlecode.com/svn/trunk/ hunpos-read-only
    

    This will create a trunk directory where you checked out.

    2) Then, to be able to successfully build it, you might need ocamlbuild for automatic compiling of Objective Caml. sudo apt-get install ocaml-nox should handle this.

    3) cd to the trunk directory (where you downloaded Hunpos source code) and do

    ./build.sh build
    

    4) At this point, you shall have a binary file tagger.native in your trunk directory. Put the whole trunk directory in your /usr/local/bin (you may need to do it as super user).

    5) Download the en_wsj.model.gz file here, unzip it and put the en_wsj.model binary also in usr/local/bin.

    6) Finally, in your python script, you may create an instance of HunposTagger class passing the paths to both files you have created previously, something very close to:

    >>> from nltk.tag.hunpos import HunposTagger
    >>> ht = HunposTagger(path_to_model='/usr/local/bin/en_wsj.model', \
                          path_to_bin=  '/usr/local/bin/trunk/tagger.native')
    >>> ht.tag('I want to go to San Francisco next year'.split())
    [('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('go', 'VB'), ('to', 'TO'),
     ('San', 'NNP'), ('Francisco', 'NNP'), ('next', 'JJ'), ('year', 'NN')]
    >>> ht.close()
    

    (Don't forget to close... if you don't like to close, you may use the with statement as well)

    7) If you still have some trouble, try to set an environmental variable HUNPOS to /usr/local/bin/trunk. To do this, you may add the following line to your ~/.bashrc (or ~/.bash_profile in MacOS):

    export HUNPOS=/usr/local/bin/trunk
    

    and restart your terminal.

    That worked for me, but if someone has a better, shorter or simpler way to set this up, please I'd love to hear :)