I am using a dictionary to search for occurrences of terms in a corpus where the terms may appear separately, though they will most often overlap:
corpus <- c("According to the Canadian Charter of Rights and Freedoms, all Canadians...")
dict <- dictionary(list(constitution = c("charter of rights", "canadian charter")))
kwic(corpus, dict)
The above will (correctly) identify the below sentence twice:
"According to the Canadian Charter of Rights and Freedoms, all Canadians..."
In order to establish the frequency at which these terms appear, however, and to avoid double counting, I would need to make sure that instances where the term "canadian charter" appears are only counted if it is not follow by "..of rights..."
How can I accomplish this?
Edit: just noticed this is not an issue if using tokens_lookup
so this question is a mute point. Leaving it up in case it is helpful to anyone else.
When you ask for a kwic
, you will get all pattern matches, even when these overlap. So the way to avoid the overlap, in the way that I think you are asking, is to manually convert the multi-word expressions (MWEs) into single tokens in a way that prevents their overlap. In your case you want to count "Canadian charter" when it is not followed by "of rights". I would then suggest you tokenize the text, and then compound the MWEs in a sequence that guarantees that they will not overlap.
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- "The Canadian charter of rights and the Canadian charter are different."
dict <- dictionary(list(constitution = c("charter of rights", "canadian charter")))
toks <- tokens(txt)
tokscomp <- toks %>%
tokens_compound(phrase("charter of rights"), concatenator = " ") %>%
tokens_compound(phrase("Canadian charter"), concatenator = " ")
tokscomp
## tokens from 1 document.
## text1 :
## [1] "The" "Canadian" "charter of rights"
## [4] "and" "the" "Canadian charter"
## [7] "are" "different" "."
This has made the phrases into single tokens, delimited here by a space, and this will mean that in kwic()
(if that is what you want to use) will not double count them since they are now uniquely MWE matches.
kwic(tokscomp, dict, window = 2)
##
## [text1, 3] The Canadian | charter of rights | and the
## [text1, 6] and the | Canadian charter | are different
Note that simply to count them, you could have used dfm()
with your dictionary as the value of a select
argument:
dfm(tokscomp, select = dict)
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs charter of rights canadian charter
## text1 1 1
Finally, if you had wanted principally to distinguish "Canadian charter of rights" from "Canadian charter", you could have compounded the former first and then the latter (longest to shortest is best here). But that is not exactly what you asked.