Search code examples
rnlptm

How to concatenate, element-wise, two document corpora in R using tm


Start with two corpora of documents, each with the same number of documents:

library(tm)
c1 <- Corpus(VectorSource(c("document 1 corpus 1 text", "document 2 corpus 1 text")))
c2 <- Corpus(VectorSource(c("document 1 corpus 2 text", "document 2 corpus 2 text")))

I want a single corpus of the same number of documents with the terms combined element-wise to form a single document, the equivalent of:

c3 <- Corpus(VectorSource(c("document 1 corpus 1 text document 1 corpus 2 text", 
                            "document 2 corpus 1 text document 2 corpus 2 text"))

Searching has turned up the tm_combine function, but that combines the documents from different corpora into a single corpus with twice the (or, the sum of the individual) number of documents.


Solution

  • You can loop through each corpus and paste corresponding entries together. Then, convert back into a corpus:

    Corpus(VectorSource(
      mapply(function(x, y) paste(content(x), content(y)), c1, c2)
    ))