I use this way to remove stop words from text
dfm <-
tokens(df$text,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(pattern = stopwords(source = "smart")) %>%
tokens_wordstem()
However in the result I found that there are stop words like this one:
dont
Is there any way to remove them without using a customized list of stopwords?
When you say “remove them” I am assuming that you mean remove dont
from your tokens, whereas the existing stopwords list only removes don’t
. (Although this was not entirely clear from your question or from how some of the answers have interpreted it.) Two simple solutions exist within the quanteda framework.
First, you can append additional removal patterns to the tokens_remove()
call.
Second, you could process the character vector returned by stopwords()
to also include the versions without apostrophes.
Illustration:
library("quanteda")
## Package version: 1.5.1
toks <- tokens("I don't know what I dont or cant know.")
# original
tokens_remove(toks, c(stopwords("en")))
## tokens from 1 document.
## text1 :
## [1] "know" "dont" "cant" "know" "."
# manual addition
tokens_remove(toks, c(stopwords("en"), "dont", "cant"))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."
# automatic addition to stopwords
tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."