I have a dictionary with multiple subcategories and I would like to find the most frequent words and bigrams within each subcategory using R.
I am using a large dataset but here's an example of what I have looks like:
s <- "Day after day, day after day,
We stuck, nor breath nor motion;"
library(stringi)
x <- stri_replace_all(s, "", regex="<.*?>")
x <- stri_trim(s)
x <- stri_trans_tolower(s)
library(quanteda)
toks <- tokens(x)
toks <- tokens_wordstem(toks)
dtm <- dfm(toks,
tolower=TRUE, stem=TRUE,
remove=stopwords("english"))
dict1 <- dictionary(list(a=c("day*", "week*", "month*"),
b=c("breath*","motion*")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")
tail(dict_dtm2)
This gives me the total frequencies per subcategory but not the frequency of each individual word within these subcategories. The results I am looking for would look something like this:
words(a) freq
day 4
week 0
month 0
words(b) freq
breath 1
motion 1
I would appreciate any help with that!
As far as I understand your question, I believe you are in the look for the table()
command. You need to work a little bit of regular expressions to treat the first sentence, but I believe you can do it. An idea can be as following:
s <- "day after day day after day We stuck nor breath nor motion"
s <- strsplit(s, "\\s+")
dict <- list(a<- c("day", "week", "month"),
b<-c("breath","motion"))
lapply(dict, function(x){
Wordsinvect<-intersect(unlist(x),unlist(s))
return(table(s)[Wordsinvect])}
)
# [[1]]
# day
# 4
#
# [[2]]
# s
# breath motion
# 1 1
I hope it helps. Cheers !