Search code examples
python-3.xlemmatization

Lemmatization of words using spacy and nltk not giving correct lemma


I want to get the lemmatized words of the words in list given below:

(eg)

words = ['Funnier','Funniest','mightiest','tighter']

When I do spacy,

import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
    print(items.lemma_)

I got the lemmas like:

Funnier
Funniest
mighty
tight 

When I go for nltk WordNetLemmatizer

from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
    print(token + ' --> ' +  lemmatizer.lemmatize(token))

I got:

Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter

Anyone help for this.

Thanks.


Solution

  • Lemmatisation is totally dependent on the part of speech tag that you are using while getting the lemma of the particular word.

    # Define the sentence to be lemmatized
    sentence = "The striped bats are hanging on their feet for best"
    
    # Tokenize: Split the sentence into words
    word_list = nltk.word_tokenize(sentence)
    print(word_list)
    #> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
    
    # Lemmatize list of words and join
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    print(lemmatized_output)
    #> The striped bat are hanging on their foot for best
    

    The above code is a simple example of how to use the wordnet lemmatizer on words and sentences.

    Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().

    Sometimes, the same word can have a multiple lemmas based on the meaning / context.

    print(lemmatizer.lemmatize("stripes", 'v'))  
    #> strip
    
    print(lemmatizer.lemmatize("stripes", 'n'))  
    #> stripe
    

    For the above example(), specify the corresponding pos tag:

    from nltk.stem import WordNetLemmatizer 
    lemmatizer = WordNetLemmatizer() 
    words = ['Funnier','Funniest','mightiest','tighter','biggify']
    for token in words:
        print(token + ' --> ' +  lemmatizer.lemmatize(token, wordnet.ADJ_SAT))