Search code examples
rversiontext-miningbackwards-compatibility

Combine corpora in tm 0.7.3


Using the text mining package tm for R, the following works in version 0.6.2, R version 3.4.3:

library(tm)
a = "This is the first document."
b = "This is the second document."
c = "This is the third document."
d = "This is the fourth document."
docs1 = VectorSource(c(a,b))
docs2 = VectorSource(c(c,d))
corpus1 = Corpus(docs1)
corpus2 = Corpus(docs2)
corpus3 = c(corpus1,corpus2)
inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4

However, the same code in tm version 0.7.3 (R version 3.4.2) gives an error:

Error in UseMethod("inspect", x) :
  no applicable method for 'inspect' applied to an object of class "list"

According to vignette("tm",package="tm"), the c() function is overloaded:

Many standard operators and functions ([, [<-, [[, [[<-, c(), lapply()) are available for corpora with semantics similar to standard R routines. E.g., c() concatenates two (or more) corpora. Applied to several text documents it returns a corpus. The metadata is automatically updated, if corpora are concatenated (i.e., merged).

However, for the new version this is apparently no longer the case. How can two corpora be combined in tm 0.7.3? An obvious solution is to combine the documents first and create the corpus afterwards, but I'm looking for a solution to combine two already existing corpora.


Solution

  • I do not have much experience with the tm package so my answer may lack some nuance in understanding of SimpleCorpus vs VCorpus vs other tm object classes.

    The inputs to your call to c are the class SimpleCorpus; it doesn't look like tm comes with a c method specifically for this class. So method dispatch isn't calling the right c to combine the Corpora in the way you'd want. However, there is a c method for the VCorpus class (tm:::c.VCorpus).

    There are 2 different ways to get past the issue of coercing corpus3 to a list, but they seem to result in different structures. I present both below and leave it up to you if they are accomplishing your end goal.

    1) You can call tm:::c.VCorpus directly when defining corpus3:

    > library(tm)
    > 
    > a = "This is the first document."
    > b = "This is the second document."
    > c = "This is the third document."
    > d = "This is the fourth document."
    > docs1 = VectorSource(c(a,b))
    > docs2 = VectorSource(c(c,d))
    > corpus1 = Corpus(docs1)
    > corpus2 = Corpus(docs2)
    > 
    > corpus3 = tm:::c.VCorpus(corpus1,corpus2)
    > 
    > inspect(corpus3)
    <<VCorpus>>
    Metadata:  corpus specific: 2, document level (indexed): 0
    Content:  documents: 4
    
    [1] This is the first document.  This is the second document. This is the third document. 
    [4] This is the fourth document.
    

    2) You can use VCorpus when defining corpus1 & corpus2:

    > library(tm)
    > 
    > a = "This is the first document."
    > b = "This is the second document."
    > c = "This is the third document."
    > d = "This is the fourth document."
    > docs1 = VectorSource(c(a,b))
    > docs2 = VectorSource(c(c,d))
    > corpus1 = VCorpus(docs1)
    > corpus2 = VCorpus(docs2)
    > 
    > corpus3 = c(corpus1,corpus2)
    > 
    > inspect(corpus3)
    <<VCorpus>>
    Metadata:  corpus specific: 0, document level (indexed): 0
    Content:  documents: 4
    
    [[1]]
    <<PlainTextDocument>>
    Metadata:  7
    Content:  chars: 27
    
    [[2]]
    <<PlainTextDocument>>
    Metadata:  7
    Content:  chars: 28
    
    [[3]]
    <<PlainTextDocument>>
    Metadata:  7
    Content:  chars: 27
    
    [[4]]
    <<PlainTextDocument>>
    Metadata:  7
    Content:  chars: 28