Search code examples
rtextstemming

Stemming by `hunspell` dictionary


From Stemming Words I taken the following custom stemming function:

stem_hunspell <- function(term) {
  # look up the term in the dictionary
  stems <- hunspell::hunspell_stem(term)[[1]]

  if (length(stems) == 0) { # if there are no stems, use the original term
    stem <- term
  } else { # if there are multiple stems, use the last one
    stem <- stems[[length(stems)]]
  }

  stem
}

It uses the hunspell dictionary to do the stemming (package corpus).

I tried this function on the following sentences.

sentences<-c("We're taking proactive steps to tackle ...",                     
             "A number of measures we are taking to support ...",            
             "We caught him committing an indecent act.")

And then I performed the following operations:

library(qdap)
library(tm)

sentences <- iconv(sentences, "latin1", "ASCII", sub="")

sentences <- gsub('http\\S+\\s*', '', sentences)

sentences <- bracketX(sentences,bracket='all')
sentences <- gsub("[[:punct:]]", "",sentences)

sentences <- removeNumbers(sentences)
sentences <- tolower(sentences)

# Stemming
library(corpus)

stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]

if (length(stems) == 0) { # if there are no stems, use the original term
    stem <- term
  } else { # if there are multiple stems, use the last one
    stem <- stems[[length(stems)]]
  }
  stem
}

sentences=text_tokens(sentences, stemmer = stem_hunspell)

sentences = lapply(sentences, removeWords, stopwords('en'))
sentences = lapply(sentences, stripWhitespace)

I can not explain the results:

[[1]]
[1] ""       "taking" "active" "step"   ""       "tackle"

[[2]]
[1] ""        "numb"    ""        "measure" ""        ""        "taking"  ""       
[9] "support"

[[3]]
[1] ""           "caught"     ""           "committing" ""           "decent"    
[7] "act"  

E.g. why commit and take appear in their ing-form? Why number became "numb"?


Solution

  • I think the answer is mostly that this is just the way hunspell is stemming. We can check this in an easier example:

    hunspell::hunspell_stem("taking")
    #> [[1]]
    #> [1] "taking"
    hunspell::hunspell_stem("committing")
    #> [[1]]
    #> [1] "committing"
    

    The ing-form is the only option offered by hunspell. To me this doesn't make much sense either and my suggestion would be to use a different stemmer. And while we're on it, I think you would also profit from switching to quanteda instead of tm:

    library(quanteda)
    sentences <- c("We're taking proactive steps to tackle ...",                     
                   "A number of measures we are taking to support ...",            
                   "We caught him committing an indecent act.")
    
    tokens(sentences, remove_numbers = TRUE) %>% 
      tokens_tolower() %>% 
      tokens_wordstem()
    #> Tokens consisting of 3 documents.
    #> text1 :
    #> [1] "we'r"     "take"     "proactiv" "step"     "to"       "tackl"    "."       
    #> [8] "."        "."       
    #> 
    #> text2 :
    #>  [1] "a"       "number"  "of"      "measur"  "we"      "are"     "take"   
    #>  [8] "to"      "support" "."       "."       "."      
    #> 
    #> text3 :
    #> [1] "we"     "caught" "him"    "commit" "an"     "indec"  "act"    "."
    

    The workflow is a lot cleaner in my opinion and the results make a bit more sense to me. quanteda uses the SnowballC package to do the stemming here, which you could integrate into your tm workflow if you wanted. tokens objects are text in the same order as the input object but tokenised (i.e., split into words).

    If you still wanted to use hunspell, you could do so with the following function, which clears some problems you seem to have ("number" is now correct):

    stem_hunspell <- function(toks) {
    
      # look up the term in the dictionary
      stems <- vapply(hunspell::hunspell_stem(types(toks)), "[", 1, FUN.VALUE = character(1))
    
      # if there are no stems, use the original term
      stems[nchar(stems) == 0] <- types(toks)[nchar(stems) == 0]
    
      tokens_replace(toks, types(toks), stems, valuetype = "fixed")
    
    }
    
    tokens(sentences, remove_numbers = TRUE, ) %>% 
      tokens_tolower() %>%
      stem_hunspell()
    #> Tokens consisting of 3 documents.
    #> text1 :
    #> [1] "we're"  "taking" "active" "step"   "to"     "tackle" "."      "."     
    #> [9] "."     
    #> 
    #> text2 :
    #>  [1] "a"       "number"  "of"      "measure" "we"      "are"     "taking" 
    #>  [8] "to"      "support" "."       "."       "."      
    #> 
    #> text3 :
    #> [1] "we"         "caught"     "him"        "committing" "an"        
    #> [6] "decent"     "act"        "."