Search code examples
pythonmachine-learningnlpnltkpos-tagger

Python NLTK pos_tag not returning the correct part-of-speech tag


Having this:

text = word_tokenize("The quick brown fox jumps over the lazy dog")

And running:

nltk.pos_tag(text)

I get:

[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

This is incorrect. The tags for quick brown lazy in the sentence should be:

('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')

Testing this through their online tool gives the same result; quick, brown and fox should be adjectives not nouns.


Solution

  • In short:

    NLTK is not perfect. In fact, no model is perfect.

    Note:

    As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle.

    It is now the perceptron tagger from @Honnibal's implementation, see nltk.tag.pos_tag

    >>> import inspect
    >>> print inspect.getsource(pos_tag)
    def pos_tag(tokens, tagset=None):
        tagger = PerceptronTagger()
        return _pos_tag(tokens, tagset, tagger) 
    

    Still it's better but not perfect:

    >>> from nltk import pos_tag
    >>> pos_tag("The quick brown fox jumps over the lazy dog".split())
    [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
    

    At some point, if someone wants TL;DR solutions, see https://github.com/alvations/nltk_cli


    In long:

    Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.:

    • HunPos
    • Stanford POS
    • Senna

    Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag:

    >>> from nltk import word_tokenize, pos_tag
    >>> text = "The quick brown fox jumps over the lazy dog"
    >>> pos_tag(word_tokenize(text))
    [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]
    

    Using Stanford POS tagger:

    $ cd ~
    $ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip
    $ unzip stanford-postagger-2015-04-20.zip
    $ mv stanford-postagger-2015-04-20 stanford-postagger
    $ python
    >>> from os.path import expanduser
    >>> home = expanduser("~")
    >>> from nltk.tag.stanford import POSTagger
    >>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger'
    >>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar'
    >>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar)
    >>> text = "The quick brown fox jumps over the lazy dog"
    >>> st.tag(text.split())
    [(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]
    

    Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8):

    $ cd ~
    $ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
    $ tar zxvf hunpos-1.0-linux.tgz
    $ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
    $ gzip -d en_wsj.model.gz 
    $ mv en_wsj.model hunpos-1.0-linux/
    $ python
    >>> from os.path import expanduser
    >>> home = expanduser("~")
    >>> from nltk.tag.hunpos import HunposTagger
    >>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
    >>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
    >>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
    >>> text = "The quick brown fox jumps over the lazy dog"
    >>> ht.tag(text.split())
    [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
    

    Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API):

    $ cd ~
    $ wget http://ronan.collobert.com/senna/senna-v3.0.tgz
    $ tar zxvf senna-v3.0.tgz
    $ python
    >>> from os.path import expanduser
    >>> home = expanduser("~")
    >>> from nltk.tag.senna import SennaTagger
    >>> st = SennaTagger(home+'/senna')
    >>> text = "The quick brown fox jumps over the lazy dog"
    >>> st.tag(text.split())
    [('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]
    

    Or try building a better POS tagger:


    Complains about pos_tag accuracy on stackoverflow include:

    Issues about NLTK HunPos include:

    Issues with NLTK and Stanford POS tagger include: