Search code examples
javapythonnlpstemminglemmatization

How to use Stemmer or Lemmatizer to stem specific word


I am currently trying to stem a big corpus(aprox. 800k sentences). I've managed to stem only the basic one. The problem now is that I want to stem only a specific word for example this method only applies if the lemma is a substring of the original word. For instance, the suffix for the word apples are apple and 's'. But if not a substring, it will not split it like the word teeth into tooth.

I've also read about lemmatizer WordNet, where we can add a parameter for pos such as verb, noun or adjective. Is there a way that I can apply the method above?

Thanks in advance!


Solution

  • A complete example here -

    import nltk
    from nltk.corpus import wordnet
    from difflib import get_close_matches as gcm
    from itertools import chain
    from nltk.stem.porter import *
    
    texts = [ " apples are good. My teeth will fall out.",
              " roses are red. cars are great to have"]
    
    lmtzr = nltk.WordNetLemmatizer()
    stemmer = PorterStemmer()
    
    for text in texts:
        tokens = nltk.word_tokenize(text) # should sent tokenize it first
        token_lemma = [ lmtzr.lemmatize(token) for token in tokens ] # take your pick here between lemmatizer and wordnet synset.
        wn_lemma = [ gcm(word, list(set(list(chain(*[i.lemma_names() for i in wordnet.synsets(word)]))))) for word in tokens ]
        #print(wn_lemma) # works for unconventional words like 'teeth' --> tooth. You might want to take a closer look
        tokens_final = [ stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i] for i in range(len(tokens)) ]
        print(tokens_final)
    

    Output

    ['appl', 'are', 'good', '.', 'My', 'teeth', 'will', 'fall', 'out', '.']
    ['rose', 'are', 'red', '.', 'car', 'are', 'great', 'to', 'have']
    

    Explanation

    Notice stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i] this is where the magic happens. If the lemmatized word is a subset of the main word, then the word gets stemmed, otherwise it just remains lemmatized.

    Note

    The lemmatization that you are attempting has some edge cases. WordnetLemmatizer is not smart enough to handle exceptional cases like 'teeth' --> 'tooth'. In those cases you would want to take a look at Wordnet.synset which might come in handy.

    I have included a small case in the comments for your investigation.