I'm having trouble summing approximate matches of text strings, as well as pulling information from the string that was matched first in time.
I have data that look like this:
text<-c("THEN it goes West","AT it falls East","it goes West", "it falls East", "AT it goes West")
date<-c(2008,2009,2003,2006,2011)
ID<-c(1,2,3,4,5)
data<-cbind(text,date,ID)
data<-as.data.frame(data)
Notice that the latest text strings have all-caps "THEN" and "AT" added to the earlier text strings.
I would like a table that looks like this:
ID Sum Originaltext Originaldate
[1,] "4" "3" "it goes West" "2003"
[2,] "2" "2" "it falls East" "2006"
This includes:
The ID number corresponding with the text with the earliest date (the "original" text that the others were derived from). Sums of all approximate matches for each. The text corresponding with the earliest date. And the date of the text corresponding with the earliest date.
I have tens of millions of cases, so I'm having trouble automating the process.
I run Windows 7, and have access to fast-computing servers.
IDEAS
#order them backwards in time
data<-data[order(data$date, decreasing = TRUE),]
#find the strings with the latest date
pattern<-"AT|THEN"
k <- vector("list", length(data$text))
for (j in 1:length(data$text)){
k[[j]]<- grep(pattern,data$text[[j]], ignore.case=FALSE)
}
k<-subset(data$text, k==1)
k<-unique(k)
#this is a problem, because case nos. 1 and 5 are still in the dataset, but they derive from the same tweet.
From here, I can use "agrep", but I'm not sure in what context. Any help would be greatly appreciated!
NOTE: While the three answers below do answer my question the way I originally asked it, I have not mentioned that my text cases do vary even without the words "AT" and "THEN". In fact, most of them do not match exactly. I should have put this in the original question. However, I would still love an answer.
Thanks!
A data.table
solution avoiding stringr
. I am sure this could be improved
# make the factor columns character
.data <- lapply(data, function(x) if(is.factor(x)) {as.character(x)} else { x})
library(data.table)
DT <- as.data.table(.data)
DT[, original_text := text]
# using `%like% which is an easy data.table wrapper for grepl
DT[text %like% "^THEN", text := substr(text, 6, nchar(text))]
DT[text %like% "^AT", text := substr(text, 4, nchar(text))]
# or avoiding the two vector scans and replacing in one fell swoop
DT[,text := gsub('(^THEN )|(^AT )', '', text)]
DT[, c(sum=.N, .SD[which.min(date)]) ,by=text]
# assuming that text is a factor
DTF <- as.data.table(data)
DTF[, original_text := text]
levels_text <- DTF[, levels(text)]
new_levels <- gsub('(^THEN )|(^AT )', x= levels_text ,'')
# reset the levels
setattr(DTF[['text']], 'levels', new_levels)
# coerce to character and do the same count / min date
DTF[, c(sum=.N, .SD[which.min(date)]) ,by=list(text = as.character(text))]