r tm

Removing words from a DocumentTermMatrix

My friend and I are working on transforming some tweets we collected into a dtm in order to be able to run a sentiment analysis using machine learning in R. The task must be performed in R, because it is for an exam at our university, where R is required to be used as a tool.

Initially we have collected a smaller sample, in order to test if our code was working, before we would start coding a larger dataset. Our problem is that we can't seem to figure out how to remove custom words from the dtm. Our code so far looks something like this (we are primarily using the tm package):

 file <- read.csv("Tmix.csv",
           row.names = NULL, sep=";", header=TRUE) #just for loading the dataset

tweetsCorpus <- Corpus(VectorSource(file[,1]))

tweetsDTM <- DocumentTermMatrix(tweetsCorpus,
                                control = list(verbose = TRUE,
                                               asPlain = TRUE,
                                               stopwords = TRUE,
                                               tolower = TRUE,
                                               removeNumbers = TRUE,
                                               stemWords = FALSE,
                                               removePunctuation = TRUE,
                                               removeSeparators = TRUE,
                                               removeTwitter = TRUE,
                                               stem = TRUE,
                                               stripWhitespace = TRUE, 
                                               removeWords = c("customword1", "customword2", "customword3")))

We've also tried removing the words before converting into a dtm, using the removeWords command, together with all of the "removeXXX" commands in the tm package, and then converting it to a dtm, but it doesn't seem to work.

It is important that we don't simply remove all words with i.e. 5 or less observations. We need all observations, except the ones we want to remove like for instance https-adresses and stuff like that.

Does anyone know how we do this?

And a second question: Is there any easier way to remove all words that start with https instead of having to write all of the adresses individually into the code. Right now for instance we are writing "httpstcokozcejeg", "httpstcolskjnyjyn", "httpstcolwwsxuem" as single custom words to remove from the data.

NOTE: We know that RemoveWords is a terrible solution to our problem, but we can't figure out how else to do it.

Solution

You can use regular expressions, for example:

gsub("http[a-z]*","","httpstcolwwsxuem here")
[1] " here"

Assuming that you removed punctuation/digits in tweetsCorpus, you can use the following:

1- Direct gsub

tweetsCorpus <- gsub("http[a-z]*","",tweetsCorpus[[1]][[1]])

2- tm::tm_map, content_transformer

library(tm)

RemoveURL <- function(x){
        gsub("http[a-z]*","",x)
}

tweetsCorpus <- tm_map(tweetsCorpus, content_transformer(RemoveURL))