Search code examples
python-2.7nltklemmatization

NLTK lemmatization wrong result


I've use NLTK and got wrong result like this:

>>> print lmtzr.lemmatize('coding', 'v')
cod

I consider the answer is "code" instead of a fish. Is there anyway to solve this or other python Lib can do better job?


Solution

  • One way to fix this is to add the word 'coding' to wordnet._exception_map:

    import nltk.stem as stem
    import nltk.corpus as corpus
    wordnet = corpus.wordnet
    wordnet._exception_map['v']['coding'] = ['code']
    wnl = stem.WordNetLemmatizer()   
    
    print(wnl.lemmatize('coding', 'v'))
    # code
    

    Note that attributes which start with a single underscore are considered private -- i.e. they are not part of the public interface. So modifying wordnet._exception_map as above is not guaranteed to work in future versions of nltk. (The above works with NLTK version 3.0.0. It was found by looking at the source code for WordNetLemmatizer.lemmatize and wordnet._morphy.)

    Another way to fix the problem is to modify nltk_data/corpora/wordnet/verb.exc. The contents of the file looks like:

    cockneyfied cockneyfy
    codded cod
    codding cod
    codified codify
    cogged cog
    cogging cog
    

    if you add

    coding code
    

    then this exception is added to wordnet._exception_map automatically for you.

    The third option, less hacky then the previous two, is to convince the developers of Wordnet to add coding code to nltk_data/copora/wordnet/verb.exc.