Consider this simple example
dfm1 <- tibble(text = c('hello world',
'hello quanteda')) %>%
corpus() %>% tokens() %>% dfm()
> dfm1
Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
2 x 3 sparse Matrix of class "dfm"
features
docs hello world quanteda
text1 1 1 0
text2 1 0 1
and
dfm2 <- tibble(text = c('hello world',
'good nigth quanteda')) %>%
corpus() %>% tokens() %>% dfm()
Document-feature matrix of: 2 documents, 5 features (50.0% sparse).
2 x 5 sparse Matrix of class "dfm"
features
docs hello world good nigth quanteda
text1 1 1 0 0 0
text2 0 0 1 1 1
As you can see, we have the same text identifiers in the two dfms
: text1
and text2
.
I would like to "subtract" dfm2
to dfm1
so that each entry in dfm1
is subtracted to its (possibly) matching entry in dfm2
(same text, same word)
So for instance, in text1
, hello
occur 1 time and in text2
it also occurs 1 time. So the output should have 0 for that entry (that is: 1-1). Of course, entries that are not in both dfms
should be kept unchanged.
How can I do that in quanteda?
You can match the feature set of a dfm to that of another dfm using dfm_match()
. I've also tidied up your code since for this short example, some of your pipeline could be simplified.
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dfm1 <- dfm(c("hello world", "hello quanteda"))
dfm2 <- dfm(c("hello world", "good night quanteda"))
as.dfm(dfm1 - dfm_match(dfm2, features = featnames(dfm1)))
## Document-feature matrix of: 2 documents, 3 features (33.3% sparse).
## 2 x 3 sparse Matrix of class "dfm"
## features
## docs hello world quanteda
## text1 0 0 0
## text2 1 0 0
The as.dfm()
comes from the fact that the +
operator is defined for the parent sparse Matrix class and not specifically for a quanteda dfm, so it drops the dfm's class and turns it into a dgCMatrix
. Coercing it back into a dfm using as.dfm()
solves that but it will drop the original attributes of the dfm objects such as docvars.