I'm trying to figure out how I could identify documents (tweets in this case) based on a term they may include.
Say I have this data frame (df), which is composed of a list of the screen name of Twitter users and one of their tweets.
> df
ScreenName tweet
[1,] "Guy A" "one random tweet"
[2,] "Guy B" "another random tweet"
[3,] "Guy C" "a third random piece of text"
Well, within this data frame I would like to get the tweets that include a certain term -say "tweet"- and extract those in to a new data frame (df2) like so:
> df2
ScreenName tweet
[1,] "Guy A" "one random tweet"
[2,] "Guy B" "another random tweet"
I assume there must be a way to do it using the tm or qdap packages. But could not find anything and so ended up with this mess;
After cleaning the corpus I convert to termDocumentMatrix
tdm <- TermDocumentMatrix(corpus, control=list(minWordLength=1))
I then identify in which row of the Term Document Matrix the term I am interested in is
t <- as.vector(tdm[term,])
Subset - if term has been mentioned more than once
t.df <- as.data.frame(t)
t.sub <- subset(t.df, t >= 1)
Get document number (row number)
t.n <- as.numeric(rownames(t.sub))
Create new data frames where t.tw - only including tweets mentioning term and t.o - other tweets
t.tw <- tw[t.n,]
t.o <- tw[!1:nrow(tw) %in% t.n, ]
Thanks for your help.
Apologies if the horrendous piece of code above has offended any accomplished R users.
I'd stay in base for this and use the grep
function (if you already have a data.frame
) with the following line:
df[grep("tweet", df$tweet), ]
Here it is in whole with your data:
df <- read.table(text='ScreenName tweet
"Guy A" "one random tweet"
"Guy B" "another random tweet"
"Guy C" "a third random piece of text"', header=TRUE)
df[grep("tweet", df$tweet), ]
## ScreenName tweet
## 1 Guy A one random tweet
## 2 Guy B another random tweet