Search code examples
rtext-miningtm

Text Mining - removePunctuation not removing quotes and dashes


I have been doing some text mining. I created the DTM matrix using the following steps.

corpus1<-VCorpus(VectorSource(resume1$Dat1)) 

corpus1<-tm_map(corpus1,content_transformer(tolower)) 
corpus1<-tm_map(corpus1,content_transformer(trimWhiteSpace))

dtm<-DocumentTermMatrix(corpus1, 
                           control = list(removePunctuation = TRUE, 
                                          removeNumbers = TRUE, 
                                          removeSparseTerms=TRUE, 
                                            stopwords = TRUE)) 

​After all the run I am still getting words like -quotation, "fun, model"​ , etc in dtm.Also getting blanks like " " in the data

What can I do about it? I do not need this dahses and extra quotations.


Solution

  • I'm not sure why DocumentTermMatrix isn't working for you, but you could try using tm_map to pre-process the corpus before transforming it into a dtm. This works for me (Note that I reorder the default stoplist because otherwise it removes the stems of apostrophe words before the entire word, leaving stranded 's'):

    corpus1 <- VCorpus(VectorSource(resume1$dat))
    
    reorder.stoplist <- c(grep("[']", stopwords('english'), value = TRUE), 
                          stopwords('english')[!(1:length(stopwords('english')) %in% grep("[']", stopwords('english')))])
    
    corpus1 <- tm_map(corpus1, content_transformer(tolower))
    corpus1 <- tm_map(corpus1, removeWords, reorder.stoplist)
    corpus1 <- tm_map(corpus1, removePunctuation)
    corpus1 <- tm_map(corpus1, removeNumbers)
    corpus1 <- tm_map(corpus1, stripWhitespace)
    
    corpus1 <- DocumentTermMatrix(corpus1)