Search code examples
pythonnltklemmatization

NLTK WordNetLemmatizer processes "US" as "u"


If you feed the word "US" (United States), after preprocessing (which becomes "us", i.e in lower case) into the WordNetLemmatizer from package nltk.stem, it is translated to "u". For example:

from nltk.stem import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
word = "US".lower()  #  "US" becomes "us"
lemma = lmtzr.lemmatize(word)
print(lemma)  # prints "u"

I have even tried to lemmatize the word using POS tagging, which results in an 'NNP' (NN=Noun and P=Proper, i.e proper noun) according to the pos_tag() function from package nltk. But 'NNP' is a wordnet.NOUN, which is the default behavior of the lemmatizer when it processes a word. Therefore, lmtzr.lemmatize(word) and lmtz.lemmatize(word, wordnet.NOUN) is the same (where wordnet is imported from package nltk.stem.wordnet).

Any ideas about how to tackle this problem, apart from the clumsy way of explicitly excluding the processing of the word "us" in a text from the lemmatizer using an if statement?


Solution

  • If you look at the source code of WordNetLemmatizer

    def lemmatize(self, word, pos=NOUN):
        lemmas = wordnet._morphy(word, pos)
        return min(lemmas, key=len) if lemmas else word
    

    wordnet._morphy returns ['us', 'u']

    min(lemmas, key=len) returns the shortest word which is u

    wordnet._morphy uses a rule for nouns which replaces ending "s" with "".

    Here is the list of substitutions

    [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')]

    I don't see a very clean way out.

    1) You may write a special rule for excluding all-upper-case words.

    2) Or you may add a line us us

    to the file nltk_data/corpora/wordnet/noun.exc

    3) You may write your own function to select the longest word (which might be wrong for other words)

    from nltk.corpus.reader.wordnet import NOUN
    from nltk.corpus import wordnet
    def lemmatize(word, pos=NOUN):
        lemmas = wordnet._morphy(word, pos)
        return max(lemmas, key=len) if lemmas else word