Search code examples
rsparse-matrixquanteda

Concatenate dfm matrices in 'quanteda' package


Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any.

An example:

dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)

gives an error.

The 'tm' package can concatenate its dfm matrices out of box; it is too slow for my purposes.

Also recall that 'dfm' from 'quanteda' is a S4 class.


Solution

  • Should work "out of the box", if you are using the latest version:

    packageVersion("quanteda")
    ## [1] ‘0.9.6.9’
    
    dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
    dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
    
    rbind(dfm1, dfm2)
    ## Document-feature matrix of: 2 documents, 6 features.
    ## 2 x 6 sparse Matrix of class "dfmSparse"
    ##      is one sample surprise text this
    ## doc1  1   1      2        0    1    1
    ## doc2  1   1      2        1    1    1
    

    See also ?selectFeatures where features is a dfm object (there are examples in the help file).

    Added:

    Note that this will correctly align the two texts in a common feature set, unlike the normal rbind methods for matrices, whose columns must match. For the same reasons, rbind() does not actually work in the tm package for DocumentTermMatrix objects with different terms:

    require(tm)
    dtm1 <- DocumentTermMatrix(Corpus(VectorSource(c(doc1 = "This is one sample text sample."))))
    dtm2 <- DocumentTermMatrix(Corpus(VectorSource(c(doc2 = "Surprise! This is one sample text sample."))))
    rbind(dtm1, dtm2)
    ## Error in f(init, x[[i]]) : Numbers of columns of matrices must match.
    

    This almost gets it, but seems to duplicate the repeated feature:

    as.matrix(rbind(c(dtm1, dtm2)))
    ##     Terms
    ## Docs one sample sample. text this surprise!
    ##    1   1      1       1    1    1         0
    ##    1   1      1       1    1    1         1