Search code examples
rapplytm

How to add metadata to tm Corpus object with tm_map


I have been reading different questions/answers (especially here and here) without managing to apply any to my situation.

I have a 11,390 rows matrix with attributes id, author, text, such as:

library(tm)

m <- cbind(c("01","02","03","04","05","06"),
           c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
           c("Text1","Text2","Text3","Text4","Text5","Text6"))

I want to create a tm corpus out of it. I can quickly create my corpus with

tm_corpus <- Corpus(VectorSource(m[,3]))

which terminates execution for my 11,390 rows matrix in

   user  system elapsed 
  2.383   0.175   2.557 

But then when I try to add metadata to the corpus with

meta(tm_corpus, type="local", tag="Author") <- m[,2]

the execution time is over the 15 minutes and counting (I then stopped execution).

According to the discussion here chances are to decreasing significantly the time in processing the corpus with tm_map; something like

tm_corpus <- tm_map(tm_corpus, addMeta, m[,2])

Still I am not sure how to do this. Probably it is going to be something like

addMeta <- function(text, vector) {
  meta(text, tag="Author") = vector[??]
  text
}

For one thing how to pass to tm_map a vector of values to be assign to each text of the corpus? Should I call the function from within a loop? Should I enclose the tm_map function within vapply?


Solution

  • Yes tm_map is faster and it is the way to go. You should use it here with a global counter.

    auths <- paste0('Author',seq(11390))
    i <- 0
    tm_corpus = tm_map(tm_corpus, function(x) {
       i <<- i +1
       meta(x, "Author") <- m[i,2]
       x
    })