Search code examples
rtext-miningtext-analysisquanteda

Get term frequencies within categories in R dictionary


I have a dictionary with multiple subcategories and I would like to find the most frequent words and bigrams within each subcategory using R.

I am using a large dataset but here's an example of what I have looks like:

s <-  "Day after day, day after day,
We stuck, nor breath nor motion;"

library(stringi)
x <- stri_replace_all(s, "", regex="<.*?>") 
x <- stri_trim(s)
x <- stri_trans_tolower(s) 

library(quanteda)
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 

dtm <- dfm(toks, 
       tolower=TRUE, stem=TRUE,
       remove=stopwords("english"))

dict1 <- dictionary(list(a=c("day*", "week*", "month*"),
                    b=c("breath*","motion*")))

dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
tail(dict_dtm2)    

This gives me the total frequencies per subcategory but not the frequency of each individual word within these subcategories. The results I am looking for would look something like this:

words(a)   freq
day         4
week        0
month       0

words(b)   freq
breath     1
motion     1 

I would appreciate any help with that!


Solution

  • As far as I understand your question, I believe you are in the look for the table() command. You need to work a little bit of regular expressions to treat the first sentence, but I believe you can do it. An idea can be as following:

    s <-  "day after day day after day We stuck nor breath nor motion"
    s <- strsplit(s, "\\s+")
    
    dict <- list(a<- c("day", "week", "month"),
                            b<-c("breath","motion"))
    lapply(dict, function(x){
                    Wordsinvect<-intersect(unlist(x),unlist(s))
                    return(table(s)[Wordsinvect])}
    )
    
    
    # [[1]]
    # day 
    # 4 
    # 
    # [[2]]
    # s
    # breath motion 
    # 1      1 
    

    I hope it helps. Cheers !