Search code examples
rtext-miningtm

Quotes and hyphens not removed by tm package functions while cleaning corpus


I'm trying to clean the corpus and I've used the typical steps, like the code below:

docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)

Yet when I inspect the matrix there are few words that come with quotes, such as: "we" "company" "code guidelines" -known -accelerated

It seems that the words themselves are inside the quotes but when I try to run removePunctuation code again it doesn't work. Also there are some words with bullets in front of that I also can't remove.

Any help would be greatly appreciated.


Solution

  • removePunctuation uses gsub('[[:punct:]]','',x) i.e. removes symbols: !"#$%&'()*+, \-./:;<=>?@[\\\]^_{|}~`. To remove other symbols, like typographic quotes or bullet signs (or any other), declare your own transformation function:

    removeSpecialChars <- function(x) gsub("“•”","",x)
    docs <- tm_map(docs, removeSpecialChars)
    

    Or you can go further and remove everything that is not alphanumerical symbol or space:

    removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
    docs <- tm_map(docs, removeSpecialChars)