If you feed the word "US"
(United States), after preprocessing (which becomes "us"
, i.e in lower case) into the WordNetLemmatizer
from package nltk.stem
, it is translated to "u"
. For example:
from nltk.stem import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
word = "US".lower() # "US" becomes "us"
lemma = lmtzr.lemmatize(word)
print(lemma) # prints "u"
I have even tried to lemmatize the word using POS tagging, which results in an 'NNP'
(NN=Noun and P=Proper, i.e proper noun) according to the pos_tag()
function from package nltk
. But 'NNP'
is a wordnet.NOUN
, which is the default behavior of the lemmatizer when it processes a word. Therefore, lmtzr.lemmatize(word)
and lmtz.lemmatize(word, wordnet.NOUN)
is the same (where wordnet
is imported from package nltk.stem.wordnet
).
Any ideas about how to tackle this problem, apart from the clumsy way of explicitly excluding the processing of the word "us"
in a text from the lemmatizer using an if
statement?
If you look at the source code of WordNetLemmatizer
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
wordnet._morphy
returns ['us', 'u']
min(lemmas, key=len)
returns the shortest word which is u
wordnet._morphy
uses a rule for nouns which replaces ending "s"
with ""
.
Here is the list of substitutions
[('s', ''),
('ses', 's'),
('ves', 'f'),
('xes', 'x'),
('zes', 'z'),
('ches', 'ch'),
('shes', 'sh'),
('men', 'man'),
('ies', 'y')]
I don't see a very clean way out.
1) You may write a special rule for excluding all-upper-case words.
2) Or you may add a line us us
to the file nltk_data/corpora/wordnet/noun.exc
3) You may write your own function to select the longest word (which might be wrong for other words)
from nltk.corpus.reader.wordnet import NOUN
from nltk.corpus import wordnet
def lemmatize(word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return max(lemmas, key=len) if lemmas else word