Search code examples
rtwitter

Remove spaces and numbers in text analysis


I've doing an extract of twitter with R, but when analysing the output I get a lot of spaces counted and also numbers. How can I remove these

I'm using the following code:

tweets <- searchTwitter('weather', n=10,lang='en')
t <- twListToDF(tweets)
tw.text <- t[,"text"]
tw.text <- tolower(tw.text)
tw.text <- removeWords(tw.text,c(stopwords('en'),'rt'))
tw.text <- removePunctuation(tw.text,TRUE)
tw.text <- unlist(strsplit(tw.text,' '))
word <- sort(table(tw.text),TRUE)
wordc <- head(word,n=10)

When I run wordc I get the following:

> wordc
tw.text
                       RT      weather       County          EST       Severe Thunderstorm      Warning           25        430PM 
          31            4            4            3            3            3            3            3            2            2 

As you see I get 31 entries blanks, 2 entries with the number 25 and 2 entries with 430PM. How can I remove these types of entries?


Solution

  • After tw.text <- unlist(strsplit(tw.text,' ')), you have a vector of text elements. You can use a sub and a which function to get the values which aren't blank. Here's an example:

    foo <- c("hi"," ","     ","test")
    bar <- foo[which(sub(" +","",foo)!="")]
    length(bar)
    [1] 2
    print(bar)
    [1] "hi"   "test"
    

    Of course, if you want all the spaces removed from each entry, you can move the sub function around to store the stripped values (ie. sub(" +","",foo) gives you a vector with no whitespace)