Search code examples
rquanteda

Remove custom stopwords and phrases using quanteda


I have my stopword list which I would like to use it to remove specific phrases from text:

   #dummy text
    df2 <- c("hi my name is Ann and code code all the time! However not after that I would like")

mystopwords <- c("hi", "code code", "not after that")

I use this option:

myDfm <- df2 %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
  tokens_remove(pattern = c(stopwords(source = "smart"), mystopwords)) %>%
  tokens_wordstem() %>%
  tokens_ngrams(n = c(1, 3)) %>%
  dfm()

but when I check the frequency of bigram or trigram they didn't removed just stemmed.

Is there anything wrong in the syntax?


Solution

  • You could achieve that by using phrase() function when you are using the list of stop-phrases.

    It works like this:

    library(quanteda)
    df2 <- c("hi my name is Ann and code code all the time! However not after that I would like")
    
    mystopwords <- c("hi", "code code", "not after that")
    
    df2 %>% tokens %>% 
      tokens_remove(pattern = phrase(mystopwords), valuetype = 'fixed')
    
    ## tokens from 1 document.
    ## text1 :
    ##  [1] "my"      "name"    "is"      "Ann"     "and"     "all"     "the"     "time"    "!"       "However" "I"       "would"  
    ## [13] "like"   
    

    You can get the detailed information about how to work with multiword expressions in quanteda here: https://quanteda.io/articles/pkgdown/examples/phrase.html