Search code examples
pythonnltklemmatization

nltk lemmatizer doesn't know what to do with the word Americans


Ran the following:

from nltk import WordNetLemmatizer as wnl
wnl().lemmatize("American")
wnl().lemmatize("Americans")

Both of which simply return their argument. I would like Americans to reduce down to American. Anybody have any idea how to make this happen?

I assumed I'd have to modify whatever internal dictionary the lemmatizer is using. Is this correct? Anybody know a better way?

Thanks!


Solution

  • You can convert the word to lower case before giving it to the lemmatizer, and restore the case afterwards.

    I have used this code in the past:

    word = 'Americans'
    lemmatized = wnl().lemmatize(word.lower())
    if word.istitle():
        word = lemmatized.capitalize()
    else:
        word = lemmatized
    # word = 'American'
    

    This assumes that there is no case of multiple upper case letters in a word (like "MySpace"), which was true for my case that time. I think this is generally true, since words with multiple uppercase letters tend to be a proper noun, and hence there is usually no need to lemmatize them.

    If you're concerned with all UPPERCASE word, you can include that case also:

    word = 'AMERICANS'
    lemmatized = wnl().lemmatize(word.lower())
    if word.istitle():
        word = lemmatized.capitalize()
    elif word.upper()==word:
        word = lemmatized.upper()
    else:
        word = lemmatized
    # word = 'AMERICAN'