Search code examples
rquanteda

How to get a list of the types of stopwords removed from dataset using QUANTEDA, R


I'm working on a text dataset with quanteda in R. I've created a corpus from the dataset, then I created a dfm with all the punctuations and stopwords in English removed using the following:

dfm_nostp <- dfm(data, remove_punct = TRUE, remove=c(stopwords("english")))

Is there a way which I can inspect how many types of punctuations and stopwords I removed from the dataset in quanteda?

Many Thanks


Solution

  • try this:

    library("quanteda")
    ## Package version: 1.5.2
    
    summarize_texts_extended <- function(x, stop_words = stopwords("en")) {
      toks <- tokens(x) %>%
        tokens_tolower()
    
      # total tokens
      ndocs <- ndoc(x)
      ntoksall <- ntoken(toks)
      ntoks <- sum(ntoksall)
    
      # punctuation
      toks <- tokens(toks, remove_punct = TRUE, remove_symbols = FALSE)
      npunct <- ntoks - sum(ntoken(toks))
    
      # symbols and emoji
      toks <- tokens(toks, remove_symbols = TRUE)
      nsym <- ntoks - npunct - sum(ntoken(toks))
    
      # numbers
      toks <- tokens(toks, remove_numbers = TRUE)
      nnumbers <- ntoks - npunct - nsym - sum(ntoken(toks))
    
      # words
      nwords <- ntoks - npunct - nsym - nnumbers
    
      # stopwords
      dfmat <- dfm(toks)
      nfeats <- nfeat(dfmat)
      dfmat <- dfm_remove(dfmat, stop_words)
      nstopwords <- nfeats - nfeat(dfmat)
    
      list(
        total_tokens = ntoks,
        total_punctuation = npunct,
        total_symbols = nsym,
        total_numbers = nnumbers,
        total_words = nwords,
        total_stopwords = nstopwords
      )
    }
    

    It returns, as a list, the quantities you want:

    summarize_texts_extended(data_corpus_inaugural)
    ## $total_tokens
    ## [1] 149138
    ## 
    ## $total_punctuation
    ## [1] 13852
    ## 
    ## $total_symbols
    ## [1] 4
    ## 
    ## $total_numbers
    ## [1] 85
    ## 
    ## $total_words
    ## [1] 135197
    ## 
    ## $total_stopwords
    ## [1] 136