Search code examples
rtm

Removing Punctuation, Numbers, and Whitespace not working


I am trying to remove punctuation, numbers, and white space from a corpus.

My code is:

# Create a corpus
bd_corpus =  Corpus(VectorSource(bd_text))

# Clean the corpus by removing puncuation, numbers, and white spaces
bd_clean <- tm_map(bd_corpus,removePunctuation)
bd_clean <- tm_map(bd_corpus,removeNumbers)
bd_clean <- tm_map(bd_corpus,removeStripwhitespace)

wordcloud(bd_clean)

#modify your word cloud
wordcloud(bd_clean, random.order = F, max.words = 25, scale = c(7, 0.5))

It outputs a word cloud, but there are colons, backslashes, periods, etc in the word cloud such as "here," and "hey," and "people."

Additionally here is the console output:

# Clean the corpus by removing puncuation, numbers, and white spaces
> bd_clean <- tm_map(bd_corpus,removePunctuation)

Warning message:
In tm_map.SimpleCorpus(bd_corpus, removePunctuation) :
  transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeNumbers)

Warning message:
In tm_map.SimpleCorpus(bd_corpus, removeNumbers) :
  transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeStripwhitespace)

Error in tm_map.SimpleCorpus(bd_corpus, removeStripwhitespace) : 
  object 'removeStripwhitespace' not found

Solution

  • From @Gregor in comments above:

    Let's say I have x <- 1. Then I run these commands: y <- x + 1, y <- x + 2, y <- x + 3. What is y, at the end? 4 is the right answer - because when we run y <- x + 3, it doesn't matter what y was before. You're doing the same thing: bd_clean <- tm_map(bd_corpus,removePunctuation) removes the the punctuation from bd_corpus. Your next line bd_clean <- tm_map(bd_corpus,removeNumbers) removes numbers from bd_corpus, and overwrites version without punctuation. Instead, you need to have bd_clean <- tm_map(bd_corpus, bd_clean), to build on what you've already done.