Search code examples
rdata-miningtext-miningtm

Self-conflicting stopwords in R tm text mining


I'm in the process of cleaning up data for text mining. This involves removing numbers, punctuation, and stopwords (common words that would just be noise in the data mining), and later doing word stemming.

Using the tm package in R, you can remove stopwords, for example using tm_map(myCorpus, removeWords, stopwords('english')). The tm manual itself demonstrates using stopwords("english")). This word list contains contractions such as "I'd" and "I'll", as well as the very common word "I":

> library(tm)
> which(stopwords('english') == "i")
[1] 1
> which(stopwords('english') == "i'd")
[1] 69

(Text is assumed to be lowercase before removing stopwords.)

But (presumably) because "i" comes first in the list, the contractions are never removed:

> removeWords("i'd like a soda, please", stopwords('english'))
[1] "'d like  soda, please"

A quick hack is to reverse the wordlist:

> removeWords("i'd like a soda, please", rev.default(stopwords('english')))
[1] " like  soda, please"

Another solution is to find/make a better wordlist.

Is there a better/correct way to use stopwords('english')?


Solution

  • The problem here comes from the underdetermined work flow made possible by the tools you are using. Simply put, removing stop words means filtering tokens, but the text you are removing the stop words from has not yet been tokenized.

    Specifically, the i is removed from i'm because the tokeniser splits on the apostrophe. In the text analysis package quanteda, you are required to tokenise the text first and only then remove features based on token matches. For instance:

    require(quanteda)
    removeFeatures(tokenize("i'd like a soda, please"), c("i'd", "a"))
    # tokenizedText object from 1 document.
    # Component 1 :
    # [1] "like"   "soda"   ","      "please"
    

    quanteda also has a built-in list of the most common stopwords, so this works too (and here, we have also removed punctuation):

    removeFeatures(tokenize("i'd like a soda, please", removePunct = TRUE),
                   stopwords("english"))
    # tokenizedText object from 1 document.
    # Component 1 :
    # [1] "like"   "soda"   "please"
    

    In my opinion (biased, admittedly, since I designed quanteda) this is a better way to remove stopwords in English and most other languages.

    Update Jan 2021, for a more modern version of quanteda

    require("quanteda")
    ## Loading required package: quanteda
    ## Package version: 2.1.2
    
    tokens("i'd like a soda, please") %>%
      tokens_remove(c("i'd", "a"))
    ## Tokens consisting of 1 document.
    ## text1 :
    ## [1] "like"   "soda"   ","      "please"
    
    # or using the stopwords list and removing punctuation
    tokens("i'd like a soda, please", remove_punct = TRUE) %>%
      tokens_remove(stopwords("en"))
    ## Tokens consisting of 1 document.
    ## text1 :
    ## [1] "like"   "soda"   "please"
    

    Created on 2021-02-01 by the reprex package (v1.0.0)