Search code examples
nlpnltklemmatization

Is nltk wordnet lemmatizer language independent?


Is it true that nltk's wordnet lemmatizer does not depend on the language of the input text ? Would I use the same sequence of commands:

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> print(wnl.lemmatize('dogs'))
dog
>>> print(wnl.lemmatize('churches'))
church
>>> print(wnl.lemmatize('aardwolves'))
aardwolf
>>> print(wnl.lemmatize('abaci'))
abacus
>>> print(wnl.lemmatize('hardrock'))
hardrock

for both english and french for instance ?


Solution

  • In Short

    No, Wordnet lemmatizer in NLTK is only for English.

    In Long

    If we look at https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L15

    class WordNetLemmatizer(object):
    
        def __init__(self):
            pass
    
        def lemmatize(self, word, pos=NOUN):
            lemmas = wordnet._morphy(word, pos)
            return min(lemmas, key=len) if lemmas else word
    
        def __repr__(self):
            return '<WordNetLemmatizer>' 
    

    It's based on the _morphy() function at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1764 which applies several English specific substitutions

        MORPHOLOGICAL_SUBSTITUTIONS = {
        NOUN: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'),
               ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'),
               ('men', 'man'), ('ies', 'y')],
        VERB: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''),
               ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')],
        ADJ: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
        ADV: []}
    
    MORPHOLOGICAL_SUBSTITUTIONS[ADJ_SAT] = MORPHOLOGICAL_SUBSTITUTIONS[ADJ]