Search code examples
rtext-miningquanteda

Quanteda: how to remove my own list of words


Since there is no ready implementation of stopwords for Polish in quanteda, I would like to use my own list. I have it in a text file as a list separated by spaces. If need be, I can also prepare a list separated by new lines.

How can I remove the custom long list of stopwords from my corpus? How can I do that after stemming?

I have tried creating various formats, converting to string vectors like

stopwordsPL <- as.character(readtext("polish.stopwords.txt",encoding = "UTF-8"))
stopwordsPL <- read.txt("polish.stopwords.txt",encoding = "UTF-8",stringsAsFactors = F))
stopwordsPL <- dictionary(stopwordsPL)

I have also tried to use such vectors of words in syntax

myStemMat <-
  dfm(
    mycorpus,
    remove = as.vector(stopwordsPL),
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3)
  )

dfm_trim(myStemMat, sparsity = stopwordsPL)

or

myStemMat <- dfm_remove(myStemMat,features = as.data.frame(stopwordsPL))

Nothing works. My stopwords show up in the corpus and in the analysis. What should be the proper way/syntax to apply custom stop words?


Solution

  • Assuming your polish.stopwords.txt are like this then you should be able to remove them from your corpus easily this way:

    stopwordsPL <- readLines("polish.stopwords.txt", encoding = "UTF-8")
    
    dfm(mycorpus,
        remove = stopwordsPL,
        stem = FALSE,
        remove_punct = TRUE,
        ngrams=c(1,3))
    

    The solution using readtext is not working because it reads in the entire file as one document. To get the individual words, you would need to tokenise it and to coerce the tokens to character. Probably readLines() is easier.

    No need to create a dictionary from stopwordsPL either, since remove should take a character vector. Also, there is no Polish stemmer implemented yet, I am afraid.

    Currently (v0.9.9-65) the feature removal in dfm() does not get rid of stop words that form bigrams. To override this, try:

    # form the tokens, removing punctuation
    mytoks <- tokens(mycorpus, remove_punct = TRUE)
    # remove the Polish stopwords, leave pads
    mytoks <- tokens_remove(mytoks, stopwordsPL, padding = TRUE)
    ## can't do this next one since no Polish stemmer in 
    ## SnowballC::getStemLanguages()
    # mytoks <- tokens_wordstem(mytoks, language = "polish")
    # form the ngrams
    mytoks <- tokens_ngrams(mytoks, n = c(1, 3))
    # construct the dfm
    dfm(mytoks)