I am analyzing the texts of several thousand newspaper articles and I'd like to construct issue dictionaries (e.g. health care, taxes, crime, etc.). Each dictionary entry is made up of several terms (e.g. doctors, nurses, hospitals, etc)
As a diagnosis, I'd like to see which terms are making up the bulk of each dictionary category.
The code illustrates where I'm at. I have worked out a way to print the top features for each dictionary entry separately, but I want one coherent dataframe at the end that I can visualize.
library(quanteda)
]# set path
path_data <- system.file("extdata/", package = "readtext")
# import csv file
dat_inaug <- read.csv(paste0(path_data, "/csv/inaugCorpus.csv"))
corp_inaug <- corpus(dat_inaug, text_field = "texts")
corp_inaug %>%
tokens(., remove_punct = T) %>%
tokens_tolower() %>%
tokens_select(., pattern=stopwords("en"), selection="remove")->tok
#I have about eight or nine dictionaries
dict<-dictionary(list(liberty=c("freedom", "free"),
justice=c("justice", "law")))
#This producesa a dfm of all the individual terms making up the dictionary
tok %>%
tokens_select(pattern=dict) %>%
dfm() %>%
topfeatures()
#This produces the top features just making up the 'justice' dictionary entry
tok %>%
tokens_select(pattern=dict['justice']) %>%
dfm() %>%
topfeatures()
#This gets me close to what I want, but I can't figure out how to collapse this now
#to visualize which are the most frequent terms that are making up each dictionary category
dict %>%
map(., function(x) tokens_select(tok, pattern=x)) %>%
map(., dfm) %>%
map(., topfeatures)
I tidied up the code and used the data_corpus_inaugural
for the example. This shows how to get a frequency data.frame by dictionary key, for the selected matches of your dictionary values in each key.
library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
toks <- data_corpus_inaugural %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = stopwords("en"))
dict <- dictionary(list(liberty = c("freedom", "free"),
justice = c("justice", "law")))
dfmat_list <- lapply(names(dict), function(x) {
tokens_select(toks, dict[x]) %>%
dfm() %>%
textstat_frequency() %>%
cbind(data.frame(dict_key = x), .)
})
do.call(rbind, dfmat_list)
#> dict_key feature frequency rank docfreq group
#> 1 liberty freedom 185 1 36 all
#> 2 liberty free 183 2 49 all
#> 11 justice justice 142 1 47 all
#> 21 justice law 129 2 38 all
Created on 2023-01-15 with reprex v2.0.2