Search code examples
rregextext-miningquanteda

Quanteda calculating tokens frequency in dfm including also a customized list of phrases


I have been wondering if it is possible to perform the feauture_frequency of the powerful quanteda library in R including also a list of phrases or "words" to be accounted for, for instance I have the following data set:

library(quanteda)
library(quanteda.textstats)

df_sample<-c("Word Record",
             "be able to count by word",
             "But also include some phrases such as",
             "World Record Super Bass Mr. President Mr. President")

When I calculate the textstat_frequency of the df_sample I get something like this:

> tokens<-corpus(df_sample) %>% tokens(remove_punct = TRUE)
> dfm<-dfm(tokens)
> 
> quanteda.textstats::textstat_frequency(dfm)
     feature frequency rank docfreq group
1       word         2    1       2   all
2     record         2    1       2   all
3         mr         2    1       1   all
4  president         2    1       1   all
5         be         1    5       1   all
6       able         1    5       1   all
7         to         1    5       1   all
8      count         1    5       1   all
9         by         1    5       1   all
10       but         1    5       1   all
11      also         1    5       1   all
12   include         1    5       1   all
13      some         1    5       1   all
14   phrases         1    5       1   all
15      such         1    5       1   all
16        as         1    5       1   all
17     world         1    5       1   all
18     super         1    5       1   all
19      bass         1    5       1   all
> 

which is correct but I also want to change my code in other to take into account and print in the output the words or phrases "Mr. President", "World Record", "Super Bass"

key_lookups<-c("Mr. President", "World Record", "Super Bass" )

How can I use quanteda funs to have in my output along with the previous counts also the frequency of the previous phrases,for example

"Mr. President" 2 "World Record" 2 "Super Bass" 1


Solution

  • First: a warning about your example code: do not create objects that have the same name as functions (like tokens and dfm) this will (eventually) lead to errors and is difficult to debug.

    There are probably a few ways of doing this. I created a "normal" tokens object and one ngrams tokens object. both turned into dfm's and from the ngrams dfm, I kept the phrases you wanted. Then combined the dfm's and you can use textstat_frequency as normal.

    Note: you can't combine tokens objects like you can combine dfm objects.

    library(quanteda)
    library(quanteda.textstats)
    
    df_sample<-c("Word Record",
                 "be able to count by word",
                 "But also include some phrases such as",
                 "World Record Super Bass Mr. President Mr. President")
    
    
    
    my_tokens <- corpus(df_sample) %>% tokens(remove_punct = TRUE)
    my_dfm <- dfm(my_tokens)
    
    # No points as they are removed in the dfm
    key_lookups<-c("Mr President", "World Record", "Super Bass" )
    
    
    my_tokens_ngram <- tokens_ngrams(my_tokens, n = 2, concatenator = " ")
    
    my_dfm_ngrams <- dfm(my_tokens_ngram)
    
    # Only keep the lookups
    my_dfm_ngrams <- dfm_keep(my_dfm_ngrams, key_lookups)
    
    # Combine both dfms
    my_dfms <- rbind(my_dfm, my_dfm_ngrams)
    
    # if necessary uncomment next part
    # my_dfms <- dfm_compress(my_dfms) 
    

    outcome:

    head(textstat_frequency(my_dfms), 5)
           feature frequency rank docfreq group
    1         word         2    1       2   all
    2       record         2    1       2   all
    3           mr         2    1       1   all
    4    president         2    1       1   all
    5 mr president         2    1       1   all
    
    tail(textstat_frequency(my_dfms), 5)
            feature frequency rank docfreq group
    18        world         1    6       1   all
    19        super         1    6       1   all
    20         bass         1    6       1   all
    21 world record         1    6       1   all
    22   super bass         1    6       1   all
    

    Do note that using rbind on dfms, creates a new document name like "text1.1". If you want this merged back to the original documents, you can call dfm_compress(my_dfms) first and then call textstat_frequency.