Search code examples
rnlpquanteda

In R, combing individual word count and dictionary word count


I need to count words in a document. In some cases, I need to count specific words (e.g. "fresh"), in other cases I need to get the total count of a set of words ("philadelphia","aunt").

I know how do this in two separate steps (see code below), but how can I do this at the same time?

The code below counts specific words.

library("quanteda")
txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
tokens(txt) %>% tokens_select(c("trouble", "fight")) %>% dfm()

Output is:

trouble, fight
1, 1

The code below counts dictionary words and writes the total count to one column.

mydict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
count <-dfm(txt,dictionary = mydict)

Output is:

all_terms
3

How can I combine the two?

I would like something like this: (code is hypothetical and does NOT work)

tokens(txt) %>% tokens_select(c("trouble", "fight"), mydict) %>% dfm()

or

tokens(txt) %>% tokens_select(c("trouble", "fight"), all_terms=c("chillin","relaxin","shootin")) %>% dfm()

Desired output:

trouble, fight, all_terms
1, 1, 3

Solution

  • There are a couple of ways, this is probably the simplest. Define a dictionary where the key is equal to the word value for each specific word, and a group key for sets of words -- in your example, "all_terms".

    library("quanteda")
    ## Package version: 2.1.2
    
    txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
    
    dict <- dictionary(list(
      trouble = "trouble",
      fight = "fight",
      all_terms = c("chillin", "relaxin", "shootin")
    ))
    

    Now when you compile the dfm, you will get what you are after.

    dfmat <- dfm(txt, dictionary = dict)
    dfmat
    ## Document-feature matrix of: 1 document, 3 features (0.0% sparse).
    ##        features
    ## docs    trouble fight all_terms
    ##   text1       1     1         3
    

    To coerce this to a simpler object, including the output you listed, you can do this:

    # as a named numeric vector
    structure(as.vector(dfmat), names = featnames(dfmat))
    ##   trouble     fight all_terms 
    ##         1         1         3
    
    # per your output
    cat(
      paste(featnames(dfmat), collapse = ", "), "\n",
      paste(as.vector(dfmat), collapse = ", ")
    )
    ## trouble, fight, all_terms 
    ##  1, 1, 3
    

    Note that it's not a good idea (as in the other answer) to access the object internals directly. Use extractor functions such as featnames() instead.

    Added:

    An alternative way without creating the named list of items:

    dict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
    single_words <- c("trouble", "fight")
    
    tokens(txt) %>%
      tokens_lookup(dictionary = dict, exclusive = FALSE) %>%
      tokens_keep(pattern = c(names(dict), single_words)) %>%
      dfm()
    ## Document-feature matrix of: 1 document, 3 features (0.0% sparse).
    ##        features
    ## docs    all_terms trouble fight
    ##   text1         3       1     1