I've doing an extract of twitter with R, but when analysing the output I get a lot of spaces counted and also numbers. How can I remove these
I'm using the following code:
tweets <- searchTwitter('weather', n=10,lang='en')
t <- twListToDF(tweets)
tw.text <- t[,"text"]
tw.text <- tolower(tw.text)
tw.text <- removeWords(tw.text,c(stopwords('en'),'rt'))
tw.text <- removePunctuation(tw.text,TRUE)
tw.text <- unlist(strsplit(tw.text,' '))
word <- sort(table(tw.text),TRUE)
wordc <- head(word,n=10)
When I run wordc I get the following:
> wordc
tw.text
RT weather County EST Severe Thunderstorm Warning 25 430PM
31 4 4 3 3 3 3 3 2 2
As you see I get 31 entries blanks, 2 entries with the number 25 and 2 entries with 430PM. How can I remove these types of entries?
After tw.text <- unlist(strsplit(tw.text,' '))
, you have a vector of text elements. You can use a sub
and a which
function to get the values which aren't blank. Here's an example:
foo <- c("hi"," "," ","test")
bar <- foo[which(sub(" +","",foo)!="")]
length(bar)
[1] 2
print(bar)
[1] "hi" "test"
Of course, if you want all the spaces removed from each entry, you can move the sub
function around to store the stripped values (ie. sub(" +","",foo)
gives you a vector with no whitespace)