Search code examples
rnlpstop-wordsstemmingsnowball

How to perform stemming and put back the words in the orginal review format?


I have a dataset with one column being full_text that contains review text from an online website. I wanted to clean these reviews, by removing stop words and stemming and putting them back to their original format (having all stemmed words forming a sentence, i.e.: one row per review instead of having 1 stemmed word per row.)

I am attempting the following:

sw <- stop_words %>% filter(lexicon == "SMART")

for (j in 1:nrow(reviews_df)) {

  nostopwords <- reviews_df[j,] %>% unnest_tokens(word, full_text) %>%
                  anti_join(sw, by = "word")
  stemmed <- wordStem(nostopwords[ , "word"], language = "porter")
  
reviews_df[j, "stemmed_Description"] <- paste(stemmed, collapse = " ")

}

However, this new column stemmed_Description does not look how I wanted. It didn't perform stemming and also it is not in "sentence" style but rather as a vector of strings c("word1", "word2", "word3").

How can I achieve a result of the style: "stemmedword1 stemmedword2 stemmedword3" ?

Current output:

full_text
1 pseudoindependence no one looking over your shoulder and youre free to use your own judgement to problem solve. they sometimes expect more than what a person can give. dont overwork yourself. the packages aint going no where!
stemmed_Description
1 c("pseudoindependence", "shoulder", "youre", "free", "judgement", "problem", "solve", "expect", "person", "give", "dont", "overwork", "packages", "ain't")

Solution

  • An easier way of doing this is just using the available functions in tidytext and dplyr. No need for a loop.

    library(tidytext)
    library(dplyr)
    
    sw <- stop_words %>% filter(lexicon == "SMART")
    
    reviews_df %>% 
      unnest_tokens(word, full_text, drop = FALSE) %>% 
      anti_join(sw) %>% # remove stopwords
      mutate(word = SnowballC::wordStem(word)) %>%  # stemming
      group_by(id) %>% 
      summarise(stemmed_description = paste0(word, collapse = " "))
    
    Joining, by = "word"
    # A tibble: 2 × 2
         id stemmed_description                                                                                  
      <int> <chr>                                                                                                
    1     1 pseudoindepend shoulder your free judgement problem solv expect person give dont overwork packag aint
    2     2 pseudoindepend shoulder your free judgement problem solv expect person give dont overwork packag aint
    

    data:

    reviews_df <- data.frame(id = 1:2,
                             full_text = c("pseudoindependence no one looking over your shoulder and youre free to use your own judgement to problem solve. they sometimes expect more than what a person can give. dont overwork yourself. the packages aint going no where!",
                                           "pseudoindependence no one looking over your shoulder and youre free to use your own judgement to problem solve. they sometimes expect more than what a person can give. dont overwork yourself. the packages aint going no where!"))