I'm in the process of cleaning up data for text mining. This involves removing numbers, punctuation, and stopwords (common words that would just be noise in the data mining), and later doing word stemming.
Using the tm
package in R
, you can remove stopwords, for example using tm_map(myCorpus, removeWords, stopwords('english'))
. The tm
manual itself demonstrates using stopwords("english"))
. This word list contains contractions such as "I'd" and "I'll", as well as the very common word "I":
> library(tm)
> which(stopwords('english') == "i")
[1] 1
> which(stopwords('english') == "i'd")
[1] 69
(Text is assumed to be lowercase before removing stopwords.)
But (presumably) because "i" comes first in the list, the contractions are never removed:
> removeWords("i'd like a soda, please", stopwords('english'))
[1] "'d like soda, please"
A quick hack is to reverse the wordlist:
> removeWords("i'd like a soda, please", rev.default(stopwords('english')))
[1] " like soda, please"
Another solution is to find/make a better wordlist.
Is there a better/correct way to use stopwords('english')?
The problem here comes from the underdetermined work flow made possible by the tools you are using. Simply put, removing stop words means filtering tokens, but the text you are removing the stop words from has not yet been tokenized.
Specifically, the i
is removed from i'm
because the tokeniser splits on the apostrophe. In the text analysis package quanteda, you are required to tokenise the text first and only then remove features based on token matches. For instance:
require(quanteda)
removeFeatures(tokenize("i'd like a soda, please"), c("i'd", "a"))
# tokenizedText object from 1 document.
# Component 1 :
# [1] "like" "soda" "," "please"
quanteda also has a built-in list of the most common stopwords, so this works too (and here, we have also removed punctuation):
removeFeatures(tokenize("i'd like a soda, please", removePunct = TRUE),
stopwords("english"))
# tokenizedText object from 1 document.
# Component 1 :
# [1] "like" "soda" "please"
In my opinion (biased, admittedly, since I designed quanteda) this is a better way to remove stopwords in English and most other languages.
require("quanteda")
## Loading required package: quanteda
## Package version: 2.1.2
tokens("i'd like a soda, please") %>%
tokens_remove(c("i'd", "a"))
## Tokens consisting of 1 document.
## text1 :
## [1] "like" "soda" "," "please"
# or using the stopwords list and removing punctuation
tokens("i'd like a soda, please", remove_punct = TRUE) %>%
tokens_remove(stopwords("en"))
## Tokens consisting of 1 document.
## text1 :
## [1] "like" "soda" "please"
Created on 2021-02-01 by the reprex package (v1.0.0)