Search code examples
rdplyrtext-miningtidytext

Removing Stop words from a list of strings in R


Sample data

Dput code of my data

  x <-  structure(list(Comments = structure(2:1, .Label = c("I have a lot of home-work to be completed..", 
    "I want to vist my teacher today only!!"), class = "factor"), 
        Comment_ID = c(704, 802)), class = "data.frame", row.names = c(NA, 
    -2L))

I want to remove the stop words from the above data set using tidytext::stop_words$word and also retain the same columns in the output. Along with this how can I remove punctuation in tidytext package?

Note: I don't want to change my dataset into corpus


Solution

  • You can collapse all the words in tidytext::stop_words$word into one regex adding word boundaries. However, tidytext::stop_words$word is of length 1149 and this might be too big for regex to handle so you can remove few words which are not needed and apply this.

    For example taking only first 10 words from tidytext::stop_words$word, you can do :

    gsub(paste0(paste0('\\b', tidytext::stop_words$word[1:10], '\\b', 
         collapse = "|"), '|[[:punct:]]+'), '', x$Comments)
    
    
    #[1] "I want to vist my teacher today only"    
    #    "I have  lot of homework to be completed"