I have a corpus of texts from various countries. I am trying to see how often a specific term appears in the texts for each country. To do so, I am following the example here: https://quanteda.io/articles/pkgdown/examples/plotting.html#frequency-plots
freq_grouped <- textstat_frequency(dfm(full_corpus),
groups = "Country")
freq_const <- subset(freq_grouped, freq_grouped$feature %in% "constitution")
This works fine, except that this only captures the exact term ("constitution"). I'd like to be able to capture variations of the term (e.g. "charter of rights and freedoms") use globs (e.g. "*constitution*
"), and count the results under the same category. I tried using a dictionary for this, but I get zero results.
dict <- dictionary(list(constitution = c('*constitution*', 'charter of rights and freedoms',
'canadian charter', 'constituição*', '*constitucion*')))
freq_const <- subset(freq_grouped, freq_grouped$feature %in% dict)
freq_const
[1] feature frequency rank docfreq group
<0 rows> (or 0-length row.names)
How can I go about achieving this?
The basic answer is that you cannot subset a dfm using a dictionary or any other sort of pattern match, because dfm_subset()
requires a logical value for its subset match that matches 1:1 with documents. A dictionary would match features, not documents.
If you wanted to match features while not selecting documents, however -- which I think is what you intended -- then you can use dfm_select()
, and a quanteda dictionary is a valid input for the pattern
argument of that command. With the valuetype = "glob"
argument, furthermore, you can specify that your pattern match is a glob rather than a regex.
library("quanteda")
subdfm <- dfm(data_corpus_inaugural) %>%
dfm_select(pattern = dict, valuetype = "glob")
head(subdfm)
## Document-feature matrix of: 6 documents, 5 features (66.7% sparse).
## 6 x 5 sparse Matrix of class "dfm"
## features
## docs constitutional constitution constitutions constitutionally unconstitutional
## 1789-Washington 1 1 0 0 0
## 1793-Washington 1 1 0 0 0
## 1797-Adams 0 8 1 0 0
## 1801-Jefferson 1 2 0 0 0
## 1805-Jefferson 0 6 0 0 0
## 1809-Madison 0 1 0 0 0
textstat_frequency(subdfm)
## feature frequency rank docfreq group
## 1 constitution 206 1 37 all
## 2 constitutional 53 2 24 all
## 3 constitutions 4 3 3 all
## 4 constitutionally 4 4 3 all
## 5 unconstitutional 3 5 3 all
If you have docvars for the corpus from which you create the dfm, you can also feed these to the textstat_frequency()
call - they will be attached to dfm.