Search code examples
rquanteda

Logical combinations in quanteda dictionaries


I'm using the quanteda dictionary lookup. I'll try to formulate entries where i can lookup logical combinations of words.

For example:

Teddybear = (fluffy AND adorable AND soft)

Is this possible? I only found a solution yet to test for phrases like (Teddybear = (soft fluffy adorable)). But then it has to be the exact phrase match in the text. But how can I get results neglecting the order of the words?


Solution

  • This is not currently something that is directly possible in quanteda (v1.2.0). However, there are workarounds in which you create dictionary sequences that are permutations of your desired sequence. Here is one such solution.

    First, I will create some example texts. Note that the sequences are separated by either "," or "and" in some cases. Also, the third text has just two of your phrases rather than three. (More on that in a moment.)

    txt <- c("The toy was fluffy, adorable and soft, he said.",
             "The soft, adorable, fluffy toy was on the floor.",
             "The fluffy, adorable toy was shaped like a bear.")
    

    Now, let's generate a pair of functions to generate permutation sequences and subsequences from a vector. These will use some functions from the combinat package. The first is an inner function to generate permutations, the second is the main calling function that can generate full-length permutations or any subsample down to subsample_limit. (To use these more generally, of course, I'd add error checking, but I've skipped that for this example.)

    genperms <- function(vec) {
        combs <- combinat::permn(vec)
        sapply(combs, paste, collapse = " ")
    }
    
    # vec any vector
    # subsample_limit integer from 1 to length(vec), subsamples from
    # which to return permutations; default is no subsamples
    permutefn <- function(vec, subsample_limit = length(vec)) {
        ret <- character()
        for (i in length(vec):subsample_limit) {
            ret <- c(ret, 
                     unlist(lapply(combinat::combn(vec, i, simplify = FALSE), 
                                   genperms)))
        }
        ret
    }
    

    To demonstrate how these work:

    fas <- c("fluffy", "adorable", "soft")
    permutefn(fas)
    # [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
    # [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
    
    # and with subsampling:
    permutefn(fas, 2)
    #  [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
    #  [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
    #  [7] "fluffy adorable"      "adorable fluffy"      "fluffy soft"         
    # [10] "soft fluffy"          "adorable soft"        "soft adorable" 
    

    Now apply these to the texts using tokens_lookup(). I've avoided the punctuation issue by setting remove_punct = TRUE. To show the original tokens not replaced, I have also uses exclusive = FALSE.

    tokens(txt, remove_punct = TRUE) %>%
        tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
                      exclusive = FALSE)
    # tokens from 3 documents.
    # text1 :
    # [1] "The"      "toy"      "was"      "fluffy"   "adorable" "and"      "soft"    
    # [8] "he"       "said"    
    # 
    # text2 :
    # [1] "The"      "TEDDYBEAR" "toy"       "was"       "on"        "the"      
    # [8] "floor"    
    # 
    # text3 :
    # [1] "The"      "fluffy"   "adorable" "toy"      "was"      "shaped"   "like"    
    # [8] "a"        "bear"   
    

    The first case here was not caught, because the second and third elements were separated by "and". We can remove that using tokens_remove(), and then get the match:

    tokens(txt, remove_punct = TRUE) %>%
        tokens_remove("and") %>%
        tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
                      exclusive = FALSE)
    # tokens from 3 documents.
    # text1 :
    # [1] "The"       "toy"       "was"       "TEDDYBEAR" "he"        "said"     
    # 
    # text2 :
    # [1] "The"       "TEDDYBEAR" "toy"       "was"       "on"        "the"       "floor"    
    # 
    # text3 :
    # [1] "The"      "fluffy"   "adorable" "toy"      "was"      "shaped"   "like"    
    # [8] "a"        "bear"  
    

    Finally, to match the third text in which just two of the three dictionary elements exist, we can pass 2 as the subsample_limit argument:

    tokens(txt, remove_punct = TRUE) %>%
        tokens_remove("and") %>%
        tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas, 2))), 
                      exclusive = FALSE)
    # tokens from 3 documents.
    # text1 :
    # [1] "The"       "toy"       "was"       "TEDDYBEAR" "he"        "said"     
    # 
    # text2 :
    # [1] "The"       "TEDDYBEAR" "toy"       "was"       "on"        "the"       "floor"    
    # 
    # text3 :
    # [1] "The"       "TEDDYBEAR" "toy"       "was"       "shaped"    "like"      "a"        
    # [8] "bear" 
    #