Search code examples
rtext-miningquanteda

R Quanteda Filtering, counting and grouping features from a Customized dictionary


I have the following data set:

library(quanteda)
library(quanteda.textstats)

df_test<-c("I find water to be so healthy and refreshing",
           "Nothing like a freshly made burguer to make me feel good",
           "I dislike sugar in the morning it tastes horrible",
           "A nice burguer is always crispy and spicy",
           "It is beyond me to dare to drink soda it's just gross too much sugar",
           "Yes I will have a hot burguer anytime is so cheap and tasty")

I want to be able to built a Customized dictionary so that I can classify words/tokens into two categories "Negative" and "Positive" after that I want to filter by the most frequent words/tokens and plot the positive and negative words associated with them

This is my dictionary

dict_custom <- dictionary(list(positive = c("healthy", "refreshing", "good", "crispy", 
                                      "spicy", "cheap", "tasty"),
                               negative=c("horrible","gross")))

What are some of the most frequent tokens?

> tok_df<-corpus(df_test) %>% tokens(remove_punct=TRUE) %>% tokens_remove(stopwords("en"))
> 
> tok_df %>% dfm() %>% 
+   textstat_frequency(5)  
  feature frequency rank docfreq group
1 burguer         3    1       3   all
2   sugar         2    2       2   all
3    find         1    3       1   all
4   water         1    3       1   all
5 healthy         1    3       1   all

I want to choose burger and get all the positive and negative words (after using my dictionary) and count the number of times they appear also perhaps create a word_cloud

I'm using this code:

> tokens_lookup(tok_df,dictionary = dict_custom) %>% 
+   dfm()
Document-feature matrix of: 6 documents, 2 features (50.00% sparse) and 0 docvars.
       features
docs    positive negative
  text1        2        0
  text2        1        0
  text3        0        1
  text4        2        0
  text5        0        1
  text6        2        0

but instead of words I get the count of positive and negative tokens per document.

My desired output will contain a matrix/dfm like object filter by burger with all of the negative and positive tokens (crispy, healthy, gross, ect) instead of the count of neg and pos tokens by document (that I do not want).

By the way, what if I want to instead of creating a neg and positive words, rather assign a numeric value lets say gross=-5 and crispy=5 how can I join and merge my tokens with this kind of dictionary so that I afterwards can summarize the numeric output?


Solution

  • The best way to do this is using the ability of tokens_select() to filter on dictionaries. By indexing each key separately - below, using lapply - then you can create a list of dfm objects whose features are the value matches for each key.

    library("quanteda")
    #> Package version: 3.2.4
    #> Unicode version: 14.0
    #> ICU version: 70.1
    #> Parallel computing: 10 of 10 threads used.
    #> See https://quanteda.io for tutorials and examples.
    library("quanteda.textstats")
    
    df_test <- c("I find water to be so healthy and refreshing",
                 "Nothing like a freshly made burguer to make me feel good",
                 "I dislike sugar in the morning it tastes horrible",
                 "A nice burguer is always crispy and spicy",
                 "It is beyond me to dare to drink soda it's just gross too much sugar",
                 "Yes I will have a hot burguer anytime is so cheap and tasty")
    
    dict_custom <- dictionary(list(positive = c("healthy", "refreshing", "good", "crispy", 
                                                "spicy", "cheap", "tasty"),
                                   negative = c("horrible","gross")))
    
    toks <- tokens(df_test)
    
    dfm_list <- lapply(
        names(dict_custom), 
        function(x) {
            tokens_select(toks, dict_custom[x]) |>
                dfm()
        }
    )
    names(dfm_list) <- names(dict_custom)
    

    Now you have a list of dfm objects, named by your dictionary keys, which you can then get frequencies for, or wordclouds, etc.

    dfm_list
    #> $positive
    #> Document-feature matrix of: 6 documents, 7 features (83.33% sparse) and 0 docvars.
    #>        features
    #> docs    healthy refreshing good crispy spicy cheap tasty
    #>   text1       1          1    0      0     0     0     0
    #>   text2       0          0    1      0     0     0     0
    #>   text3       0          0    0      0     0     0     0
    #>   text4       0          0    0      1     1     0     0
    #>   text5       0          0    0      0     0     0     0
    #>   text6       0          0    0      0     0     1     1
    #> 
    #> $negative
    #> Document-feature matrix of: 6 documents, 2 features (83.33% sparse) and 0 docvars.
    #>        features
    #> docs    horrible gross
    #>   text1        0     0
    #>   text2        0     0
    #>   text3        1     0
    #>   text4        0     0
    #>   text5        0     1
    #>   text6        0     0
    

    Frequencies:

    lapply(dfm_list, textstat_frequency)
    #> $positive
    #>      feature frequency rank docfreq group
    #> 1    healthy         1    1       1   all
    #> 2 refreshing         1    1       1   all
    #> 3       good         1    1       1   all
    #> 4     crispy         1    1       1   all
    #> 5      spicy         1    1       1   all
    #> 6      cheap         1    1       1   all
    #> 7      tasty         1    1       1   all
    #> 
    #> $negative
    #>    feature frequency rank docfreq group
    #> 1 horrible         1    1       1   all
    #> 2    gross         1    1       1   all
    

    Created on 2023-01-04 with reprex v2.0.2