Search code examples
rnlpquanteda

Find frequencies of multiple words combined as one?


I am trying to find the frequencies of several words totalled.

For example, I am using this code to find the frequencies of some words

keyterms <- c("canadian", "american", "british")
dict <- dictionary(list(keyterms2 = c("canadian", "american", "british"))))


dfm <- dfm(toks) %>%
  dfm_group(groups = "Organization") %>%
  dfm_select(pattern = keyterms)

When I run the above using keyterms and the dictionary, I get the frequencies for each word individually.

A header canadian american british
Organization 10 10 10

Is there a way to write the script so that it returns the frequencies totalled up so that it looks like this:

A header terms
Organization 30

Thank you


Solution

  • The dictionary approach is the most elegant solution, since it combines your keyword terms.

    Here, I've illustrated how you can do this with the built-in inaugural corpus, where your groups (similar to your "Organization") is the president's name.

    library("quanteda")
    ## Package version: 3.1
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    keyterms <- c("canadian", "american", "british")
    dict <- dictionary(list(terms = keyterms))
    
    toks <- data_corpus_inaugural %>%
      corpus_subset(Year > 2000) %>%
      tokens() %>%
      tokens_lookup(dictionary = dict)
    
    dfm(toks) %>%
      dfm_group(groups = President) %>%
      convert(to = "data.frame")
    ##   doc_id terms
    ## 1  Biden     9
    ## 2   Bush     6
    ## 3  Obama     8
    ## 4  Trump    11
    

    (You can rename the first column to "A header" if you wish.)

    Note that the usage for groups changed in quanteda 3.0, so now its value should not be quoted.