I want to get the lemmatized words of the words in list given below:
(eg)
words = ['Funnier','Funniest','mightiest','tighter']
When I do spacy,
import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
print(items.lemma_)
I got the lemmas like:
Funnier
Funniest
mighty
tight
When I go for nltk WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token))
I got:
Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter
Anyone help for this.
Thanks.
Lemmatisation is totally dependent on the part of speech tag that you are using while getting the lemma of the particular word.
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best
The above code is a simple example of how to use the wordnet lemmatizer on words and sentences.
Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().
Sometimes, the same word can have a multiple lemmas based on the meaning / context.
print(lemmatizer.lemmatize("stripes", 'v'))
#> strip
print(lemmatizer.lemmatize("stripes", 'n'))
#> stripe
For the above example(), specify the corresponding pos tag:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token, wordnet.ADJ_SAT))