I am applying wordNet lemmatizer into my corpus and I need to define the pos tagger for lemmatizer:
stemmer = PorterStemmer()
def lemmitize(document):
return stemmer.stem(WordNetLemmatizer().lemmatize(document, pos='v'))
def preprocess(document):
output = []
for token in gensim.utils.simple_preprocess(document):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
print("lemmitize: ", lemmitize(token))
output.append(lemmitize(token))
return output
Now as you can see I am defining pos for verb (and I know wordNet default pos is a noun), however when I lemmatized my document:
the left door closed at the night
I am getting out put as:
output: ['leav', 'door', 'close', 'night']
which this is not what i was expecting. In my above sentences, left
points to which door (e.g. right or left). If I choose pos ='n'
this problem may solve but it will then act as a wornNet default and there will be no effects on words like taken
.
I found a similar issue in here and I modified the exception list in nltk_data/corpora/wordnet/verb.exc
and I changed left leave
to left left
but still, I am getting the same results as leav
.
Now I am wondering if there is any solution to this problem or in the best case, is there any way that I can add a custom dictionary of some words (only limited to my document) that wordNet does not lemmatize them like:
my_dict_list = [left, ...]
You can add a custom dictionary for certain words, like pos_dict = {'breakfasted':'v', 'left':'a', 'taken':'v'}
By passing this customized pos_dict
along with token
into the lemmitize
function, you can use the lemmatizer for each token with a POS tag that you specify.
lemmatize(token, pos_dict.get(token, 'n'))
will pass 'n' for its second argument as a default value, unless the token is in the pos_dict
keys. You can change this default value to whatever you want.
def lemmitize(document, pos_dict):
return stemmer.stem(WordNetLemmatizer().lemmatize(document, pos_dict.get(document, 'n')))
def preprocess(document, pos_dict):
output = []
for token in gensim.utils.simple_preprocess(document):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
print("lemmitize: ", lemmitize(token, pos_dict))
output.append(lemmitize(token, pos_dict))
return output