I would like to perform the amazing quanteda
's dfm_lookup()
on a dictionary but also retrieve the matches.
Consider the following example:
dict_ex <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
dfmat_ex <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")),
remove = stopwords("english"))
dfmat_ex
dfm_lookup(dfmat_ex, dict_ex)
This gives me:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas opposition taxglob taxregex country
text1 1 1 1 0 0
text2 0 0 1 0 2
However, since every dictionary tool also has multiple entries, I would like to know which token produced the match. (My real dictionary is rather long, so the example might seem trivial but for the real use case, it is not.)
I would like to achieve a result like this:
Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
features
docs christmas christmas.match opposition opposition.match taxglob taxglob.match taxregex taxreg.match country country.match
text1 1 Christmas 1 Opposition 1 tax 0 NA 0 NA
text2 0 NA 0 NA 1 taxation 0 NA 2 United_States, Sweden
Can someone help me with this? Many thanks in advance! :)
That's not really possible for two reasons.
First, a matrix(-like) object (dfm or otherwise) cannot mix element modes, here a mixture of counts and character values. This would be possible with a data.frame but then you lose the advantages of sparsity, and here, you would have a n x 2*V (where V = number of features) data.frame dimensions.
Second, "christmas.match" could have more than one feature/token matching it, so the character value would require a list, straining the object class even further.
A better way would be to use kwic()
to match the tokens to the patterns formed by the dictionary. You can do this for the keys by supplying the dictionary as pattern()
, or unlisting the dictionary to get matches for each value.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dict <- dictionary(list(one = c("a*", "b"), two = c("e", "f")))
toks <- tokens(c(d1 = "a b c d e f g and another"))
# where the dictionary keys are the patterns matched
kwic(toks, dict) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f one
## 2 d1 2 2 a b c d e f g one
## 3 d1 5 5 a b c d e f g and another two
## 4 d1 6 6 a b c d e f g and another two
## 5 d1 8 8 c d e f g and another one
## 6 d1 9 9 d e f g and another one
# where the dictionary values are the patterns matched
kwic(toks, unlist(dict)) %>%
as.data.frame()
## docname from to pre keyword post pattern
## 1 d1 1 1 a b c d e f a*
## 2 d1 2 2 a b c d e f g b
## 3 d1 5 5 a b c d e f g and another e
## 4 d1 6 6 a b c d e f g and another f
## 5 d1 8 8 c d e f g and another a*
## 6 d1 9 9 d e f g and another a*