I am using this function in a script using the r text mining package (tm) to eliminate URLs from tweets. To my surprise, after clean up there are some leftover "http" words and also fragments from the URL itself (such as t.co). It looks like some of the URLS are completely wiped out, while some other are merely broken down into components. What could be the cause? NOTE: I took the . in the t.co URL. StackOverflow does not allow submitting URLs to t.co addresses.
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
removeURL <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
text before cleaning
VOTE TODAY! Go to https://tco/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://tco/KPQ5EY9VwQ
text after cleaning
vote today go https tco mxraxyntjy find polling location going make america great https tco kpqeyvwq
You are removing symbols that your removeURL function is looking for. Also, you need to make sure to create proper transformer functions with content_transformer()
. Here's a working example with a different regular expression for removing URLs (it stops at a space)
library(tm)
test<-"VOTE TODAY! Go to https://t.com/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://t.com/KPQ5EY9VwQ"
trumpcorpus1020to1109 <- VCorpus(VectorSource(test))
removeURL <- content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "", x, perl=T))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
content(trumpcorpus1020to1109[[1]])
# [1] "VOTE TODAY! Go to to find your polling location. We are going to Make America Great Again!… "