Search code examples
rquanteda

quanteda::dfm_lookup(): capture found term


I would like to perform the amazing quanteda's dfm_lookup() on a dictionary but also retrieve the matches.

Consider the following example:

dict_ex <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
                        opposition = c("Opposition", "reject", "notincorpus"),
                        taxglob = "tax*",
                        taxregex = "tax.+$",
                        country = c("United_States", "Sweden")))
dfmat_ex <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
                      "Does the United_States or Sweden have more progressive taxation?")),
             remove = stopwords("english"))

dfmat_ex
dfm_lookup(dfmat_ex, dict_ex)

This gives me:

Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
       features
docs    christmas opposition taxglob taxregex country
  text1         1          1       1        0       0
  text2         0          0       1        0       2

However, since every dictionary tool also has multiple entries, I would like to know which token produced the match. (My real dictionary is rather long, so the example might seem trivial but for the real use case, it is not.)

I would like to achieve a result like this:

Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs    christmas  christmas.match  opposition  opposition.match  taxglob  taxglob.match  taxregex  taxreg.match  country          country.match
text1         1          Christmas         1          Opposition      1              tax         0            NA        0                     NA
text2         0                 NA         0                  NA      1         taxation         0            NA        2  United_States, Sweden

Can someone help me with this? Many thanks in advance! :)


Solution

  • That's not really possible for two reasons.

    First, a matrix(-like) object (dfm or otherwise) cannot mix element modes, here a mixture of counts and character values. This would be possible with a data.frame but then you lose the advantages of sparsity, and here, you would have a n x 2*V (where V = number of features) data.frame dimensions.

    Second, "christmas.match" could have more than one feature/token matching it, so the character value would require a list, straining the object class even further.

    A better way would be to use kwic() to match the tokens to the patterns formed by the dictionary. You can do this for the keys by supplying the dictionary as pattern(), or unlisting the dictionary to get matches for each value.

    library("quanteda")
    ## Package version: 3.1
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    dict <- dictionary(list(one = c("a*", "b"), two = c("e", "f")))
    
    toks <- tokens(c(d1 = "a b c d e f g and another"))
    
    # where the dictionary keys are the patterns matched
    kwic(toks, dict) %>%
      as.data.frame()
    ##   docname from to         pre keyword            post pattern
    ## 1      d1    1  1                   a       b c d e f     one
    ## 2      d1    2  2           a       b       c d e f g     one
    ## 3      d1    5  5     a b c d       e f g and another     two
    ## 4      d1    6  6   a b c d e       f   g and another     two
    ## 5      d1    8  8   c d e f g     and         another     one
    ## 6      d1    9  9 d e f g and another                     one
    
    # where the dictionary values are the patterns matched
    kwic(toks, unlist(dict)) %>%
      as.data.frame()
    ##   docname from to         pre keyword            post pattern
    ## 1      d1    1  1                   a       b c d e f      a*
    ## 2      d1    2  2           a       b       c d e f g       b
    ## 3      d1    5  5     a b c d       e f g and another       e
    ## 4      d1    6  6   a b c d e       f   g and another       f
    ## 5      d1    8  8   c d e f g     and         another      a*
    ## 6      d1    9  9 d e f g and another                      a*