Search code examples
python-3.xlemmatizationpart-of-speech

Why NLTK's Wordnet Lemmatizer Does Not Lemmatize Adverbs and Adjectives?


As I learned, we can do a better job on lemmatization if we identify corresponding PoS tags to each token and then try lemmatizing by setting arugments to lemmatize not only verb, noun but also adjective and adverbs forms.

So I've had these lines of code that specificed all the above four types so that I can return the root forms for 'absolutely' and 'lovely'. However, I still get the same words for these.

Three questions here:

  1. Is there a way that I can address this issue while I still use the same library?
  2. Is there other library or function that can do a better lemmatization?
  3. Is this one of the limitations of NLTK's Wordnet Lemmatization that it cannot perfectly lemmatize all types of words?


Appreciate it in advance.

nltk.download('averaged_perceptron_tagger')

example=['absolutely', 'lovely']
print(nltk.pos_tag(example))

def get_pos_tags(word):
    
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, #adjective
                "N": wordnet.NOUN,#noun
                "V": wordnet.VERB,#verb
                "R": wordnet.ADV} #adverb

    return tag_dict.get(tag, wordnet.NOUN)


def lemmatize_text(text):
  text=[WordNetLemmatizer().lemmatize(w, get_pos_tags(w)) for w in text]   
  return text

final_output=lemmatize_text(example)
print (final_output)


Solution

  • For the words lovely and absolutely, the lemmas are the same. Here's a few close words you can try in NLTK.

    word:pos       -> lemma
    -------------------------
    absolute:adj   -> absolute
    absolutely:adv -> absolutely
    lovely:adj     -> lovely
    lovelier:adj   -> lovely
    loveliest:adj  -> lovely
    

    Be aware that to get the correct lemma you need the correct part-of-speech (pos) tag, and to get the correct pos tag you need to parse a well formed sentence with the word in it, so the tagger has the context. Without this, you will often get the wrong pos tag for the word.

    In general NLTK is a fairly poor at pos tagging and at lemmatization. It's an old library that is rule based and it doesn't use more modern techniques. I would generally not recommend using NLTK.

    Spacy is probably the most popular NLP system and it will do pos tagging and lemmatization (among other things) all in the same step. Unfortunately Spacy's lemmatizer uses the same basic design as NLTK and while its performance is better, it's still not the best.

    Lemminflect gives the best overall performance but it's only a lemma/inflection lookup. It doesn't include a pos tagger so you still need to get the tag from somewhere. Lemminflect also acts as a plug-in for spacy and using the two together will give you the best performance. Lemminflect's homepage shows how to do this along with some stats on performance compared to NLTK and Spacy.

    However, remember that you won't get the right lemmas without the right pos tag and for Spacy, or any tagger, to get that right, the word needs to be in a full sentence.