I would like to be able to group documents in my dfm by two variables - speaker and week_start. I was previously able to do this using
dfm(corpus, groups=c("speaker","week_start")
. This worked fine and grouped documents by speaker-week.
However, with the recent updates to the quanteda package I seem to be running into a few problems. So I now create the dfm first then I try to group. Below is the code
dfm <- dfm(corpus)
dfm <- dfm_group(dfm, groups = c(speaker, week_start))
However, when I do this I get the error:
Error: groups must have length ndoc(x)
I have also tried to put the docvars in quotations but this generates the same error.
We changed the usage of the groups
argument in v3 to make it more standard.
From news(Version >= "3.0", package = "quanteda")
:
We have added non-standard evaluation for
by
andgroups
arguments to access object docvars:
- The
*_sample()
functions' argumentby
, andgroups
in the*_group()
functions, now take unquoted document variable (docvar) names directly, similar to the way thesubset
argument works in the*_subset()
functions.- Quoted docvar names no longer work, as these will be evaluated literally.
- The
by = "document"
formerly sampled fromdocid(x)
, but this functionality is now removed. Instead, useby = docid(x)
to replicate this functionality.- For
groups
, the default is nowdocid(x)
, which is now documented more completely. See?groups
and?docid
.
So, to get the previous behaviour, you would want to use:
groups = interaction(speaker, week_start)
Here's an example:
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
corp <- corpus(c(
"a b c",
"a c d",
"c d d",
"d d e"
),
docvars = data.frame(
var1 = c("a", "a", "b", "b"),
var2 = c(1, 2, 1, 1)
)
)
corp %>%
tokens() %>%
dfm() %>%
dfm_group(groups = interaction(var1, var2))
## Document-feature matrix of: 3 documents, 5 features (40.00% sparse) and 2 docvars.
## features
## docs a b c d e
## a.1 1 1 1 0 0
## b.1 0 0 1 4 1
## a.2 1 0 1 1 0