Using tm package, I have a corpus of 10,900 documents (docs).
docs = Corpus(VectorSource(abstracts$abstract))
And I also have a list of terms (termslist) and all their synonyms and different spellings. I use it to transform each of synonyms or spellings into one term.
Term, Synonyms
term1, synonym1
term1, synonym2
term1, synonym3
term2, synonym1
... etc
The way I'm doing it right now is to loop through all documents, and another nester loop through all terms to find and replace.
for (s in 1:length(docs)){
for (i in 1:nrow(termslist)){
docs[[s]]$content<-gsub(termslist[i,2], termslist[i,1], docs[[s]])
}
print(s)
}
Currently this takes a second for a document (having around 1000 row in termslist), which means 10,900 seconds which is roughly 6 hours!
Is there a more optimized way of doing this within tm package or within R generally?
UPDATE:
mathematical.coffee's answer was actually helpful. I had to re-create a table with unique terms as rows and the second column would be the synonyms separated by '|' , then just loop over them. Now it takes significantly less time than before.
**[The messy] code of creating the new table:
newtermslist<-list()
authname<-unique(termslist[,1])
newtermslist<- cbind(newtermslist,authname)
syns<-list()
for (i in seq(authname)){
syns<- rbind(syns,
paste0('(',
paste(termslist[which(termslist[,1]==authname[i]),2],collapse='|')
, ')')
)
}
newtermslist<-cbind(newtermslist,syns)
newtermslist<-cbind(unlist(newtermslist[,1]),unlist(newtermslist[,2]))
I think when you wish to perform many replacements, this may be the only way to do it (i.e. sequentially, saving the replaced output as the input for the next replacement).
However, you might gain some speed trying (you will have to do some benchmarking to compare):
fixed=T
(since your synonyms are not regexes but literal spellings), useBytes=T
(**see ?gsub
- if you have multibyte locale this may or may not be a good idea). Orblue
has synonyms cerulean
, cobalt
and sky
, then your regex could be (cerulean|cobalt|sky)
with replacement blue
, so that all the synonyms for blue
are replaced in one iteration rather than in 3 separate ones. To do this, you'd preprocess your termslist - e.g. newtermslist <- ddply(terms, .(term), summarize, regex=paste0('(', paste(synonym, collapse='|'), ')'))
and then do your current loop over this. You will have fixed=F
(the default, i.e. use regex).?tm_map
and ?content_transformer
. I'm not sure if these will speed things up at all, but you could try.(Re benchmarking - try library(rbenchmark); benchmark(expression1, expression2, ...)
, or good ol' system.time
for timing, Rprof
for profiling)