I am using quanteda 4.1.0 and getting some unexpected behaviour when using a dictionary to adjust for synonyms and plurals. The ordering of the entries in the dictionary is affecting the frequency count of features.
In the example below, "banana" and its plural appears 3 times while "apple" and its plural appears twice. But I only get the correct frequency counts when the dictionary has "apple" listed before "banana". So it seems the alphabetical ordering of entries in the dictionary affects the behaviour of dfm_lookup()?
library(quanteda)
library(quanteda.textstats)
dfmat <- dfm(tokens(c("I like apples, but I don't like apple pie. Bananas are OK",
"I like bananas, but I don't like banana fritter.")))
textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
# feature frequency rank docfreq group
# 7 bananas 2 3 2 all
# 8 apples 1 8 1 all
# 9 apple 1 8 1 all
# 13 banana 1 8 1 all
#With wildcards
#This works - expected behaviour
dict <- dictionary(list(apple = c("apple*"),
banana = c("banana*")))
dfmat <- dfm_lookup(dfmat,
dictionary = dict, exclusive = FALSE, capkeys = FALSE)
textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
# feature frequency rank docfreq group
# 3 banana 3 3 2 all
# 4 apple 2 4 1 all
#This doesn't work - unexpected behaviour
dict <- dictionary(list(banana = c("banana*"),
apple = c("apple*")))
dfmat <- dfm_lookup(dfmat,
dictionary = dict, exclusive = FALSE, capkeys = FALSE)
textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
# feature frequency rank docfreq group
# 3 apple 3 3 2 all
# 4 banana 2 4 1 all
#Without wildcards - get the same (puzzling) behaviour
#This works
#dict <- dictionary(list(apple = c("apple","apples"),
# banana = c("banana","bananas")))
#This doesn't work
#dict <- dictionary(list(banana = c("banana","bananas"),
# apple = c("apple","apples")))
I think it is a bug. dfmat1
and dfmat2
should be identical, but they are not. Until this is fixed, please use tokens_lookup()
.
library(quanteda)
#> Package version: 4.1.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(c("I like apples, but I don't like apple pie. Bananas are OK",
"I like bananas, but I don't like banana fritter."))
dfmat <- dfm(toks)
dict <- dictionary(list(apple = c("apple*"),
banana = c("banana*")))
dfmat1 <- dfm_lookup(dfmat,
dictionary = dict, exclusive = FALSE, capkeys = FALSE)
dfmat2 <- dfm_lookup(dfmat,
dictionary = rev(dict), exclusive = FALSE, capkeys = FALSE)
identical(as.matrix(dfmat1), as.matrix(dfmat2))
#> [1] FALSE
dfmat3 <- dfm(tokens_lookup(toks, dictionary = rev(dict),
exclusive = FALSE, capkeys = FALSE))
identical(as.matrix(dfmat1), as.matrix(dfmat3))
#> [1] TRUE