Search code examples
rnlpdata-sciencequanteda

In R, how can I count specific words in a corpus?


I need to count the frequency of particular words. Lots of words. I know how to do this by putting all words in one group (see below), but I would like to get the count for each specific word.

This is what I have at the moment:

library(quanteda)
#function to count 
strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))}
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
df<-data.frame(txt)
mydict<-dictionary(list(all_terms=c("clouds","storms")))
corp <- corpus(df, text_field = 'txt')
#count terms and save output to "overview"
overview<-dfm(corp,dictionary = mydict)
overview<-convert(overview, to ='data.frame')

As you can see, the counts for "clouds" and "storms" are in the "all_terms" category in the resulting data.frame. Is there an easy way to get the count for all terms in "mydict" in individual columns, without writing the code for each individual term?

E.g.
clouds, storms
1, 1

Rather than 
all_terms
2

Solution

  • You want to use the dictionary values as a pattern in tokens_select(), rather than using them in a lookup function, which is what dfm(x, dictionary = ...) does. Here's how:

    library("quanteda")
    ## Package version: 2.1.2
    
    txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
    
    mydict <- dictionary(list(all_terms = c("clouds", "storms")))
    

    This creates the dfm where each column is the term, not the dictionary key:

    dfmat <- tokens(txt) %>%
      tokens_select(mydict) %>%
      dfm()
    
    dfmat
    ## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
    ##        features
    ## docs    clouds storms
    ##   text1      1      1
    

    You can turn this into a data.frame of counts in two ways:

    convert(dfmat, to = "data.frame")
    ##   doc_id clouds storms
    ## 1  text1      1      1
    
    textstat_frequency(dfmat)
    ##   feature frequency rank docfreq group
    ## 1  clouds         1    1       1   all
    ## 2  storms         1    1       1   all
    

    And while a dictionary is a valid input for a pattern (see ?pattern), you could also have just fed the character vector of values to tokens_select():

    # no need for dictionary
    tokens(txt) %>%
      tokens_select(c("clouds", "storms")) %>%
      dfm()
    ## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
    ##        features
    ## docs    clouds storms
    ##   text1      1      1