Search code examples
rquanteda

Unexpected behaviour with dfm_lookup - ordering of entries affects feature frequency counts


I am using quanteda 4.1.0 and getting some unexpected behaviour when using a dictionary to adjust for synonyms and plurals. The ordering of the entries in the dictionary is affecting the frequency count of features.

In the example below, "banana" and its plural appears 3 times while "apple" and its plural appears twice. But I only get the correct frequency counts when the dictionary has "apple" listed before "banana". So it seems the alphabetical ordering of entries in the dictionary affects the behaviour of dfm_lookup()?

library(quanteda)
library(quanteda.textstats)

dfmat <- dfm(tokens(c("I like apples, but I don't like apple pie. Bananas are OK",
                      "I like bananas, but I don't like banana fritter.")))

textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
#    feature frequency rank docfreq group
# 7  bananas         2    3       2   all
# 8   apples         1    8       1   all
# 9    apple         1    8       1   all
# 13  banana         1    8       1   all

#With wildcards
#This works - expected behaviour
dict <- dictionary(list(apple = c("apple*"),
                        banana = c("banana*")))
dfmat <-  dfm_lookup(dfmat,
                    dictionary = dict, exclusive = FALSE, capkeys = FALSE)

textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
#   feature frequency rank docfreq group
# 3  banana         3    3       2   all
# 4   apple         2    4       1   all


#This doesn't work - unexpected behaviour
dict <- dictionary(list(banana = c("banana*"),
                        apple = c("apple*")))

dfmat <-  dfm_lookup(dfmat,
                    dictionary = dict, exclusive = FALSE, capkeys = FALSE)

textstat_frequency(dfmat) %>% filter(grepl("apple|banana", feature))
#   feature frequency rank docfreq group
# 3   apple         3    3       2   all
# 4  banana         2    4       1   all


#Without wildcards - get the same (puzzling) behaviour
#This works
#dict <- dictionary(list(apple = c("apple","apples"),
#                        banana = c("banana","bananas")))
#This doesn't work
#dict <- dictionary(list(banana = c("banana","bananas"),
#                        apple = c("apple","apples")))

Solution

  • I think it is a bug. dfmat1 and dfmat2 should be identical, but they are not. Until this is fixed, please use tokens_lookup().

    library(quanteda)
    #> Package version: 4.1.0
    #> Unicode version: 15.1
    #> ICU version: 74.1
    #> Parallel computing: 16 of 16 threads used.
    #> See https://quanteda.io for tutorials and examples.
    
    toks <- tokens(c("I like apples, but I don't like apple pie. Bananas are OK",
                     "I like bananas, but I don't like banana fritter."))
    dfmat <- dfm(toks)
    
    dict <- dictionary(list(apple = c("apple*"),
                            banana = c("banana*")))
    dfmat1 <-  dfm_lookup(dfmat,
                         dictionary = dict, exclusive = FALSE, capkeys = FALSE)
    
    dfmat2 <-  dfm_lookup(dfmat,
                          dictionary = rev(dict), exclusive = FALSE, capkeys = FALSE)
    
    identical(as.matrix(dfmat1), as.matrix(dfmat2))
    #> [1] FALSE
    
    dfmat3 <-  dfm(tokens_lookup(toks, dictionary = rev(dict), 
                                 exclusive = FALSE, capkeys = FALSE))
    
    identical(as.matrix(dfmat1), as.matrix(dfmat3))
    #> [1] TRUE