WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say,

selected -> select

Which is right.

However, involved !-> involve and horsing !-> horse unless I explicitly input the 'v' (Verb) attribute.

On the python terminal, I get the right output but not in my code:

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'

The relevant section of the code is this:

for l in LDA_Row[0].split('+'):
    w=str(l.split('*')[1])
    word=lmtzr.lemmatize(w)
    wordv=lmtzr.lemmatize(w,'v')
    print wordv, word
    # if word is not wordv:
    #   print word, wordv

The whole code is here.

What is the problem?

Solution

The lemmatizer requires the correct POS tag to be accurate, if you use the default settings of the WordNetLemmatizer.lemmatize(), the default tag is noun, see https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39

To resolve the problem, always POS-tag your data before lemmatizing, e.g.

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     if not wntag:
...             lemma = word
...     else:
...             lemma = wnl.lemmatize(word, wntag)
...     print lemma
... 
This
be
a
foo
bar
sentence

Note that 'is -> be', i.e.

>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'

To answer the question with words from your examples:

>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     lemma = wnl.lemmatize(word, wntag) if wntag else word
...     print lemma
... 
These
sentence
involve
some
horse
around

Note that there are some quirks with WordNetLemmatizer:

Also NLTK's default POS tagger is under-going some major changes to improve accuracy:

And for an out-of-the-box / off-the-shelf solution to lemmatizer, you can take a look at https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66