Search code examples
rsparse-matrixquanteda

How to do add/subtract document-term matrices in quanteda?


Consider this simple example

dfm1 <- tibble(text = c('hello world',
                         'hello quanteda')) %>% 
  corpus() %>% tokens() %>% dfm()
> dfm1
Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
2 x 3 sparse Matrix of class "dfm"
       features
docs    hello world quanteda
  text1     1     1        0
  text2     1     0        1

and

dfm2 <- tibble(text = c('hello world',
                        'good nigth quanteda')) %>% 
  corpus() %>% tokens() %>% dfm()
Document-feature matrix of: 2 documents, 5 features (50.0% sparse).
2 x 5 sparse Matrix of class "dfm"
       features
docs    hello world good nigth quanteda
  text1     1     1    0     0        0
  text2     0     0    1     1        1

As you can see, we have the same text identifiers in the two dfms: text1 and text2.

I would like to "subtract" dfm2 to dfm1 so that each entry in dfm1 is subtracted to its (possibly) matching entry in dfm2 (same text, same word)

So for instance, in text1, hello occur 1 time and in text2 it also occurs 1 time. So the output should have 0 for that entry (that is: 1-1). Of course, entries that are not in both dfms should be kept unchanged.

How can I do that in quanteda?


Solution

  • You can match the feature set of a dfm to that of another dfm using dfm_match(). I've also tidied up your code since for this short example, some of your pipeline could be simplified.

    library("quanteda")
    ## Package version: 1.4.3
    ## Parallel computing: 2 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    dfm1 <- dfm(c("hello world", "hello quanteda"))
    dfm2 <- dfm(c("hello world", "good night quanteda"))
    
    as.dfm(dfm1 - dfm_match(dfm2, features = featnames(dfm1)))
    ## Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
    ## 2 x 3 sparse Matrix of class "dfm"
    ##        features
    ## docs    hello world quanteda
    ##   text1     0     0        0
    ##   text2     1     0        0
    

    The as.dfm() comes from the fact that the + operator is defined for the parent sparse Matrix class and not specifically for a quanteda dfm, so it drops the dfm's class and turns it into a dgCMatrix. Coercing it back into a dfm using as.dfm() solves that but it will drop the original attributes of the dfm objects such as docvars.