Search code examples
rkeywordwordnet

Issue with word stemming and wordnet package in r


For keyword extraction, I need to remove synonyms. But if I do not use word stemming, wordnet is unable to generate synonyms of words like "year's" "cats" etc. If I use stemming, words like "administer" becomes "adminste", wordnet is unable to recognize the word. Any solution?


Solution

  • You may want to try Lemmatization instead of Stemming, which will give you word forms that are more likely to be found in WordNet.

    Taken from nlp.stanford.edu

    Stemming usually refers to a crude heuristic process that chops off 
    the ends of words in the hope of achieving this goal correctly most
    of the time, and often includes the removal of derivational affixes. 
    Lemmatization usually refers to doing things properly with the use 
    of a vocabulary and morphological analysis of words, normally aiming 
    to remove inflectional endings only and to return the base or 
    dictionary form of a word
    

    This is because WordNet uses canonical word forms, i.e. forms of words that are dictionary-like, which is exactly what the process of Lemmatization tries to perform.

    Without giving word forms that WordNet can use (by not tokenizing for example), you cannot get it's full benefit.

    I would suggest building a simple pipeline:

    1. Tokenize
    2. Lemmatize
    3. Keyword Extraction (WordNet)