Search code examples
rtmsnowball

Stemming words in r does not work as expected


I am trying to do a very simple word steming in R and getting something very unexpected. In the code below 'complete' variable is 'NA'. Why can't I complete stem on the word easy?

library(tm) 
library(SnowballC)
dict <- c("easy")
stem <- stemDocument(dict, language = "english")
complete <- stemCompletion(stem, dictionary=dict)

Thank You!


Solution

  • You can see the internals of the stemCompletion() function with tm:::stemCompletion.

    function (x, dictionary, type = c("prevalent", "first", "longest", "none", "random", "shortest")){
    if(inherits(dictionary, "Corpus")) 
      dictionary <- unique(unlist(lapply(dictionary, words)))
    type <- match.arg(type)
    possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE))
    switch(type, first = {
      setNames(sapply(possibleCompletions, "[", 1), x)
    }, longest = {
      ordering <- lapply(possibleCompletions, function(x) order(nchar(x), 
          decreasing = TRUE))
      possibleCompletions <- mapply(function(x, id) x[id], 
          possibleCompletions, ordering, SIMPLIFY = FALSE)
      setNames(sapply(possibleCompletions, "[", 1), x)
    }, none = {
      setNames(x, x)
    }, prevalent = {
      possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), 
          decreasing = TRUE))
      n <- names(sapply(possibleCompletions, "[", 1))
      setNames(if (length(n)) n else rep(NA, length(x)), x)
    }, random = {
      setNames(sapply(possibleCompletions, function(x) {
          if (length(x)) sample(x, 1) else NA
      }), x)
    }, shortest = {
      ordering <- lapply(possibleCompletions, function(x) order(nchar(x)))
      possibleCompletions <- mapply(function(x, id) x[id], 
          possibleCompletions, ordering, SIMPLIFY = FALSE)
      setNames(sapply(possibleCompletions, "[", 1), x)
    })
    

    }

    The x argument is your stemmed terms, dictionary is the unstemmed. The only line that matters is the fifth; it does a simple regex match for the stemmed word in the list of dictionary terms.

    possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE))
    

    Therefore it fails, since it can't find a match for "easi" with "easy". If you also have the word "easiest" in your dictionary, then both terms match, since there is now a dictionary word with the same beginning four letters to match to.

    library(tm) 
    library(SnowballC)
    dict <- c("easy","easiest")
    stem <- stemDocument(dict, language = "english")
    complete <- stemCompletion(stem, dictionary=dict)
    complete
         easi   easiest 
    "easiest" "easiest"