Search code examples
pythonnlpwordnetlinguisticslemmatization

how to modify Wordnet Lemmatizer to lemmitize specific words?


I am applying wordNet lemmatizer into my corpus and I need to define the pos tagger for lemmatizer:

stemmer = PorterStemmer()
def lemmitize(document):
    return stemmer.stem(WordNetLemmatizer().lemmatize(document, pos='v'))

def preprocess(document):
output = []
    for token in gensim.utils.simple_preprocess(document):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            print("lemmitize: ", lemmitize(token))
            output.append(lemmitize(token))
    return output

Now as you can see I am defining pos for verb (and I know wordNet default pos is a noun), however when I lemmatized my document:

the left door closed at the night  

I am getting out put as:

output:  ['leav', 'door', 'close', 'night']

which this is not what i was expecting. In my above sentences, left points to which door (e.g. right or left). If I choose pos ='n' this problem may solve but it will then act as a wornNet default and there will be no effects on words like taken.

I found a similar issue in here and I modified the exception list in nltk_data/corpora/wordnet/verb.exc and I changed left leave to left left but still, I am getting the same results as leav.
Now I am wondering if there is any solution to this problem or in the best case, is there any way that I can add a custom dictionary of some words (only limited to my document) that wordNet does not lemmatize them like:

my_dict_list = [left, ...]

Solution

  • You can add a custom dictionary for certain words, like pos_dict = {'breakfasted':'v', 'left':'a', 'taken':'v'}

    By passing this customized pos_dict along with token into the lemmitize function, you can use the lemmatizer for each token with a POS tag that you specify.

    lemmatize(token, pos_dict.get(token, 'n')) will pass 'n' for its second argument as a default value, unless the token is in the pos_dict keys. You can change this default value to whatever you want.

    def lemmitize(document, pos_dict):
        return stemmer.stem(WordNetLemmatizer().lemmatize(document, pos_dict.get(document, 'n')))
    
    def preprocess(document, pos_dict):
        output = []
        for token in gensim.utils.simple_preprocess(document):
            if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
                print("lemmitize: ", lemmitize(token, pos_dict))
                output.append(lemmitize(token, pos_dict))
        return output