Search code examples
rtm

gsub function in TM package to remove URLS does not remove the entire string


I am using this function in a script using the r text mining package (tm) to eliminate URLs from tweets. To my surprise, after clean up there are some leftover "http" words and also fragments from the URL itself (such as t.co). It looks like some of the URLS are completely wiped out, while some other are merely broken down into components. What could be the cause? NOTE: I took the . in the t.co URL. StackOverflow does not allow submitting URLs to t.co addresses.

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
removeURL <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)

text before cleaning

VOTE TODAY! Go to https://tco/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://tco/KPQ5EY9VwQ

text after cleaning

vote today go https tco mxraxyntjy find polling location going make america great https tco kpqeyvwq


Solution

  • You are removing symbols that your removeURL function is looking for. Also, you need to make sure to create proper transformer functions with content_transformer(). Here's a working example with a different regular expression for removing URLs (it stops at a space)

    library(tm)
    test<-"VOTE TODAY! Go to https://t.com/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://t.com/KPQ5EY9VwQ"
    
    trumpcorpus1020to1109 <- VCorpus(VectorSource(test))
    removeURL <- content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "", x, perl=T))
    trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
    toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
    trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
    trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
    trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
    content(trumpcorpus1020to1109[[1]])
    # [1] "VOTE TODAY! Go to  to find your polling location. We are going to Make America Great Again!… "