I have my stopword list which I would like to use it to remove specific phrases from text:
#dummy text
df2 <- c("hi my name is Ann and code code all the time! However not after that I would like")
mystopwords <- c("hi", "code code", "not after that")
I use this option:
myDfm <- df2 %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(stopwords(source = "smart"), mystopwords)) %>%
tokens_wordstem() %>%
tokens_ngrams(n = c(1, 3)) %>%
dfm()
but when I check the frequency of bigram or trigram they didn't removed just stemmed.
Is there anything wrong in the syntax?
You could achieve that by using phrase()
function when you are using the list of stop-phrases.
It works like this:
library(quanteda)
df2 <- c("hi my name is Ann and code code all the time! However not after that I would like")
mystopwords <- c("hi", "code code", "not after that")
df2 %>% tokens %>%
tokens_remove(pattern = phrase(mystopwords), valuetype = 'fixed')
## tokens from 1 document.
## text1 :
## [1] "my" "name" "is" "Ann" "and" "all" "the" "time" "!" "However" "I" "would"
## [13] "like"
You can get the detailed information about how to work with multiword expressions in quanteda here: https://quanteda.io/articles/pkgdown/examples/phrase.html