I am currently trying to stem a big corpus(aprox. 800k sentences). I've managed to stem only the basic one. The problem now is that I want to stem only a specific word for example this method only applies if the lemma is a substring of the original word. For instance, the suffix for the word apples are apple and 's'. But if not a substring, it will not split it like the word teeth into tooth.
I've also read about lemmatizer WordNet, where we can add a parameter for pos such as verb, noun or adjective. Is there a way that I can apply the method above?
Thanks in advance!
A complete example here -
import nltk
from nltk.corpus import wordnet
from difflib import get_close_matches as gcm
from itertools import chain
from nltk.stem.porter import *
texts = [ " apples are good. My teeth will fall out.",
" roses are red. cars are great to have"]
lmtzr = nltk.WordNetLemmatizer()
stemmer = PorterStemmer()
for text in texts:
tokens = nltk.word_tokenize(text) # should sent tokenize it first
token_lemma = [ lmtzr.lemmatize(token) for token in tokens ] # take your pick here between lemmatizer and wordnet synset.
wn_lemma = [ gcm(word, list(set(list(chain(*[i.lemma_names() for i in wordnet.synsets(word)]))))) for word in tokens ]
#print(wn_lemma) # works for unconventional words like 'teeth' --> tooth. You might want to take a closer look
tokens_final = [ stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i] for i in range(len(tokens)) ]
print(tokens_final)
Output
['appl', 'are', 'good', '.', 'My', 'teeth', 'will', 'fall', 'out', '.']
['rose', 'are', 'red', '.', 'car', 'are', 'great', 'to', 'have']
Explanation
Notice stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i]
this is where the magic happens. If the lemmatized word is a subset of the main word, then the word gets stemmed, otherwise it just remains lemmatized.
Note
The lemmatization that you are attempting has some edge cases. WordnetLemmatizer
is not smart enough to handle exceptional cases like 'teeth' --> 'tooth'. In those cases you would want to take a look at Wordnet.synset
which might come in handy.
I have included a small case in the comments for your investigation.