Ran the following:
from nltk import WordNetLemmatizer as wnl
wnl().lemmatize("American")
wnl().lemmatize("Americans")
Both of which simply return their argument. I would like Americans to reduce down to American. Anybody have any idea how to make this happen?
I assumed I'd have to modify whatever internal dictionary the lemmatizer is using. Is this correct? Anybody know a better way?
Thanks!
You can convert the word to lower case before giving it to the lemmatizer, and restore the case afterwards.
I have used this code in the past:
word = 'Americans'
lemmatized = wnl().lemmatize(word.lower())
if word.istitle():
word = lemmatized.capitalize()
else:
word = lemmatized
# word = 'American'
This assumes that there is no case of multiple upper case letters in a word (like "MySpace"), which was true for my case that time. I think this is generally true, since words with multiple uppercase letters tend to be a proper noun, and hence there is usually no need to lemmatize them.
If you're concerned with all UPPERCASE word, you can include that case also:
word = 'AMERICANS'
lemmatized = wnl().lemmatize(word.lower())
if word.istitle():
word = lemmatized.capitalize()
elif word.upper()==word:
word = lemmatized.upper()
else:
word = lemmatized
# word = 'AMERICAN'