Search code examples
rquanteda

Visualize frequency of dictionary terms using quanteda


I am analyzing the texts of several thousand newspaper articles and I'd like to construct issue dictionaries (e.g. health care, taxes, crime, etc.). Each dictionary entry is made up of several terms (e.g. doctors, nurses, hospitals, etc)

As a diagnosis, I'd like to see which terms are making up the bulk of each dictionary category.

The code illustrates where I'm at. I have worked out a way to print the top features for each dictionary entry separately, but I want one coherent dataframe at the end that I can visualize.

library(quanteda)
]# set path
path_data <- system.file("extdata/", package = "readtext")

# import csv file
dat_inaug <- read.csv(paste0(path_data, "/csv/inaugCorpus.csv"))
corp_inaug <- corpus(dat_inaug, text_field = "texts") 
  corp_inaug %>% 
tokens(., remove_punct = T) %>% 
  tokens_tolower() %>% 
  tokens_select(., pattern=stopwords("en"), selection="remove")->tok

#I have about eight or nine dictionaries 
dict<-dictionary(list(liberty=c("freedom", "free"), 
                      justice=c("justice", "law")))
#This producesa a dfm of all the individual terms making up the dictionary
tok %>% 
tokens_select(pattern=dict) %>% 
  dfm() %>% 
  topfeatures()
  
#This produces the top features just making up the 'justice' dictionary entry
tok %>% 
  tokens_select(pattern=dict['justice']) %>% 
  dfm() %>% 
  topfeatures()
#This gets me close to what I want, but I can't figure out how to collapse this now 
#to visualize which are the most frequent terms that are making up each dictionary category

dict %>% 
  map(., function(x) tokens_select(tok, pattern=x)) %>% 
  map(., dfm) %>% 
map(., topfeatures) 

Solution

  • I tidied up the code and used the data_corpus_inaugural for the example. This shows how to get a frequency data.frame by dictionary key, for the selected matches of your dictionary values in each key.

    library("quanteda")
    #> Package version: 3.2.4
    #> Unicode version: 14.0
    #> ICU version: 70.1
    #> Parallel computing: 10 of 10 threads used.
    #> See https://quanteda.io for tutorials and examples.
    library("quanteda.textstats")
    
    toks <- data_corpus_inaugural %>% 
      tokens(remove_punct = TRUE) %>% 
      tokens_tolower() %>% 
      tokens_remove(pattern = stopwords("en"))
    
    dict <- dictionary(list(liberty = c("freedom", "free"), 
                            justice = c("justice", "law")))
    
    dfmat_list <- lapply(names(dict), function(x) {
      tokens_select(toks, dict[x]) %>%
        dfm() %>%
        textstat_frequency() %>%
        cbind(data.frame(dict_key = x), .)
    })
    
    do.call(rbind, dfmat_list)
    #>    dict_key feature frequency rank docfreq group
    #> 1   liberty freedom       185    1      36   all
    #> 2   liberty    free       183    2      49   all
    #> 11  justice justice       142    1      47   all
    #> 21  justice     law       129    2      38   all
    

    Created on 2023-01-15 with reprex v2.0.2