Sample data
Dput code of my data
x <- structure(list(Comments = structure(2:1, .Label = c("I have a lot of home-work to be completed..",
"I want to vist my teacher today only!!"), class = "factor"),
Comment_ID = c(704, 802)), class = "data.frame", row.names = c(NA,
-2L))
I want to remove the stop words from the above data set using tidytext::stop_words$word
and also retain the same columns in the output. Along with this how can I remove punctuation in tidytext
package?
Note: I don't want to change my dataset into corpus
You can collapse all the words in tidytext::stop_words$word
into one regex adding word boundaries. However, tidytext::stop_words$word
is of length 1149 and this might be too big for regex to handle so you can remove few words which are not needed and apply this.
For example taking only first 10 words from tidytext::stop_words$word
, you can do :
gsub(paste0(paste0('\\b', tidytext::stop_words$word[1:10], '\\b',
collapse = "|"), '|[[:punct:]]+'), '', x$Comments)
#[1] "I want to vist my teacher today only"
# "I have lot of homework to be completed"