Search code examples
rnlpquanteda

Quanteda group documents by multiple variables


I would like to be able to group documents in my dfm by two variables - speaker and week_start. I was previously able to do this using dfm(corpus, groups=c("speaker","week_start"). This worked fine and grouped documents by speaker-week.

However, with the recent updates to the quanteda package I seem to be running into a few problems. So I now create the dfm first then I try to group. Below is the code

dfm <- dfm(corpus)
dfm <- dfm_group(dfm, groups = c(speaker, week_start))

However, when I do this I get the error:

Error: groups must have length ndoc(x)

I have also tried to put the docvars in quotations but this generates the same error.


Solution

  • We changed the usage of the groups argument in v3 to make it more standard.

    From news(Version >= "3.0", package = "quanteda"):

    We have added non-standard evaluation for by and groups arguments to access object docvars:

    • The *_sample() functions' argument by, and groups in the *_group() functions, now take unquoted document variable (docvar) names directly, similar to the way the subset argument works in the *_subset() functions.
    • Quoted docvar names no longer work, as these will be evaluated literally.
    • The by = "document" formerly sampled from docid(x), but this functionality is now removed. Instead, use by = docid(x) to replicate this functionality.
    • For groups, the default is now docid(x), which is now documented more completely. See ?groups and ?docid.

    So, to get the previous behaviour, you would want to use:

    groups = interaction(speaker, week_start)
    

    Here's an example:

    library("quanteda")
    ## Package version: 3.0
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    corp <- corpus(c(
      "a b c",
      "a c d",
      "c d d",
      "d d e"
    ),
    docvars = data.frame(
      var1 = c("a", "a", "b", "b"),
      var2 = c(1, 2, 1, 1)
    )
    )
    corp %>%
      tokens() %>%
      dfm() %>%
      dfm_group(groups = interaction(var1, var2))
    ## Document-feature matrix of: 3 documents, 5 features (40.00% sparse) and 2 docvars.
    ##      features
    ## docs  a b c d e
    ##   a.1 1 1 1 0 0
    ##   b.1 0 0 1 4 1
    ##   a.2 1 0 1 1 0