Search code examples
rfunctionstemming

R function doesn't loop through column but repeats first row result


I am trying to use the stemming function suggested in the corpus package stemming vignette here https://cran.r-project.org/web/packages/corpus/vignettes/stemmer.html

but when I try to run the function on the entire column it seems to just be repeating the result for the first row down the rest of the rows. I'm guessing this has to do with the [[1]] within the following function. I'm guessing the solution is something along the lines of "for i in x" but I'm not familiar enough with writing functions to know how to solve this.

df <- data.frame(x = 1:7, y= c("love", "lover", "lovely", "base", "snoop", "dawg", "pound"), stringsAsFactors=FALSE)

stem_hunspell <- function(term) {
    # look up the term in the dictionary
    stems <- hunspell::hunspell_stem(term)[[1]]

    if (length(stems) == 0) { # if there are no stems, use the original term
        stem <- term
    } else { # if there are multiple stems, use the last one
        stem <- stems[[length(stems)]]
    }

    stem
}

df[3] <- stem_hunspell(df$y)


Solution

  • Your intuition is right.

    hunspell_stem(term) returns a list of length length(term) of character vectors.

    The vectors seem to have the word but only if it was found in a dictionary as the first element and the stem as the second if it isn't a stem already.

    > hunspell::hunspell_stem(df$y)
    [[1]]
    [1] "love"
    
    [[2]]
    [1] "lover" "love" 
    
    [[3]]
    [1] "lovely" "love"  
    
    [[4]]
    [1] "base"
    
    [[5]]
    [1] "snoop"
    
    [[6]]
    character(0)
    
    [[7]]
    [1] "pound"
    

    The below function returns either the stem or the original term

    stem_hunspell <- function(term) {
      stems <- hunspell::hunspell_stem(term)
      output <- character(length(term))
    
      for (i in seq_along(term)) {
        stem <- stems[[i]]
        if (length(stem) == 0) {
          output[i] <- term[i]
        } else {
          output[i] <- stem[length(stem)]
        }
      }
      return(output)
    }
    

    If you want dawg not to be returned the function becomes simpler:

    stem_hunspell <- function(term) {
      stems <- hunspell::hunspell_stem(term)
      output <- character(length(term))
    
      for (i in seq_along(term)) {
        stem <- stems[[i]]
        if (length(stem) > 0) {
          output[i] <- stem[length(stem)]
        }
      }
      return(output)
    }