Search code examples
pythonnlpnltkwordnetlemmatization

WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK


I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say,

selected -> select

Which is right.

However, involved !-> involve and horsing !-> horse unless I explicitly input the 'v' (Verb) attribute.

On the python terminal, I get the right output but not in my code:

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'

The relevant section of the code is this:

for l in LDA_Row[0].split('+'):
    w=str(l.split('*')[1])
    word=lmtzr.lemmatize(w)
    wordv=lmtzr.lemmatize(w,'v')
    print wordv, word
    # if word is not wordv:
    #   print word, wordv

The whole code is here.

What is the problem?


Solution

  • The lemmatizer requires the correct POS tag to be accurate, if you use the default settings of the WordNetLemmatizer.lemmatize(), the default tag is noun, see https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39

    To resolve the problem, always POS-tag your data before lemmatizing, e.g.

    >>> from nltk.stem import WordNetLemmatizer
    >>> from nltk import pos_tag, word_tokenize
    >>> wnl = WordNetLemmatizer()
    >>> sent = 'This is a foo bar sentence'
    >>> pos_tag(word_tokenize(sent))
    [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
    >>> for word, tag in pos_tag(word_tokenize(sent)):
    ...     wntag = tag[0].lower()
    ...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
    ...     if not wntag:
    ...             lemma = word
    ...     else:
    ...             lemma = wnl.lemmatize(word, wntag)
    ...     print lemma
    ... 
    This
    be
    a
    foo
    bar
    sentence
    

    Note that 'is -> be', i.e.

    >>> wnl.lemmatize('is')
    'is'
    >>> wnl.lemmatize('is', 'v')
    u'be'
    

    To answer the question with words from your examples:

    >>> sent = 'These sentences involves some horsing around'
    >>> for word, tag in pos_tag(word_tokenize(sent)):
    ...     wntag = tag[0].lower()
    ...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
    ...     lemma = wnl.lemmatize(word, wntag) if wntag else word
    ...     print lemma
    ... 
    These
    sentence
    involve
    some
    horse
    around
    

    Note that there are some quirks with WordNetLemmatizer:

    Also NLTK's default POS tagger is under-going some major changes to improve accuracy:

    And for an out-of-the-box / off-the-shelf solution to lemmatizer, you can take a look at https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66