Consider this simple example
tibble(text = c('a grande latte with soy milk',
'black coffee no room',
'latte is a latte',
'coke, diet coke'),
myday = c(ymd('2018-01-01','2018-01-01','2018-01-03','2018-01-03'))) %>%
corpus() %>%
tokens() %>%
dfm()
Document-feature matrix of: 4 documents, 14 features (71.4% sparse).
4 x 14 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room is coke , diet
text1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1 0 0 0 0
text3 1 0 2 0 0 0 0 0 0 0 1 0 0 0
text4 0 0 0 0 0 0 0 0 0 0 0 2 1 1
I am interested in getting the proportion of the word coffee
, aggregated by day.
That is, for day 2018-01-01
we can see that there are 10 words (a
grande
latte
with
soy
milk
black
coffee
no
room
) and coffee
is mentioned only once. So the proportion is 1/10. Same reasoning for the other days.
How can I do that in quanteda
? Of course, the idea is to avoid materializing the sparse matrix into a dense matrix.
Thanks!
This is easy and part of the core quanteda design decision to pass through your docvars from the corpus object to "downstream" objects such as a dfm. You can solve this using dfm_group()
by the myday
docvar and then weighting.
First, to make your example fully reproducible, and to assign your dfm object a name:
library("quanteda")
## Package version: 1.4.3
library("tibble")
library("lubridate")
dfmat <- tibble(
text = c(
"a grande latte with soy milk",
"black coffee no room",
"latte is a latte",
"coke, diet coke"
),
myday = c(ymd("2018-01-01", "2018-01-01", "2018-01-03", "2018-01-03"))
) %>%
corpus() %>%
tokens() %>%
dfm()
Now it's just two operations to get your desired result.
dfmat2 <- dfm_group(dfmat, groups = "myday") %>%
dfm_weight(scheme = "prop")
dfmat2
## Document-feature matrix of: 2 documents, 14 features (42.9% sparse).
## 2 x 14 sparse Matrix of class "dfm"
## features
## docs a grande latte with soy milk black coffee no room is
## 2018-01-01 0.100 0.1 0.10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0
## 2018-01-03 0.125 0 0.25 0 0 0 0 0 0 0 0.125
## features
## docs coke , diet
## 2018-01-01 0 0 0
## 2018-01-03 0.25 0.125 0.125
dfmat2[, "coffee"]
## Document-feature matrix of: 2 documents, 1 feature (50.0% sparse).
## 2 x 1 sparse Matrix of class "dfm"
## features
## docs coffee
## 2018-01-01 0.1
## 2018-01-03 0