Subset/select from a DFM using a dictionary in quanteda

I have a corpus of texts from various countries. I am trying to see how often a specific term appears in the texts for each country. To do so, I am following the example here: https://quanteda.io/articles/pkgdown/examples/plotting.html#frequency-plots

freq_grouped <- textstat_frequency(dfm(full_corpus), 
                                   groups = "Country")

freq_const <- subset(freq_grouped, freq_grouped$feature %in% "constitution")

This works fine, except that this only captures the exact term ("constitution"). I'd like to be able to capture variations of the term (e.g. "charter of rights and freedoms") use globs (e.g. "*constitution*"), and count the results under the same category. I tried using a dictionary for this, but I get zero results.

dict <- dictionary(list(constitution = c('*constitution*', 'charter of rights and freedoms', 
                                         'canadian charter', 'constituição*', '*constitucion*')))

freq_const <- subset(freq_grouped, freq_grouped$feature %in% dict)

freq_const
    [1] feature   frequency rank      docfreq   group    
    <0 rows> (or 0-length row.names)

How can I go about achieving this?

Solution

The basic answer is that you cannot subset a dfm using a dictionary or any other sort of pattern match, because dfm_subset() requires a logical value for its subset match that matches 1:1 with documents. A dictionary would match features, not documents.

If you wanted to match features while not selecting documents, however -- which I think is what you intended -- then you can use dfm_select(), and a quanteda dictionary is a valid input for the pattern argument of that command. With the valuetype = "glob" argument, furthermore, you can specify that your pattern match is a glob rather than a regex.

library("quanteda")

subdfm <- dfm(data_corpus_inaugural) %>%
    dfm_select(pattern = dict, valuetype = "glob")

head(subdfm)
## Document-feature matrix of: 6 documents, 5 features (66.7% sparse).
## 6 x 5 sparse Matrix of class "dfm"
##                  features
## docs              constitutional constitution constitutions constitutionally unconstitutional
##   1789-Washington              1            1             0                0                0
##   1793-Washington              1            1             0                0                0
##   1797-Adams                   0            8             1                0                0
##   1801-Jefferson               1            2             0                0                0
##   1805-Jefferson               0            6             0                0                0
##   1809-Madison                 0            1             0                0                0

textstat_frequency(subdfm)
##            feature frequency rank docfreq group
## 1     constitution       206    1      37   all
## 2   constitutional        53    2      24   all
## 3    constitutions         4    3       3   all
## 4 constitutionally         4    4       3   all
## 5 unconstitutional         3    5       3   all

If you have docvars for the corpus from which you create the dfm, you can also feed these to the textstat_frequency() call - they will be attached to dfm.