Search code examples
rtext-miningquanteda

Compute chi square value between ngrams and documents with Quanteda


I use Quanteda R package in order to extract ngrams (here 1grams and 2grams) from text Data_clean$Review, but I am looking for a way with R to compte Chi-square between document and the extracted ngrams :

Here the R code that I did to clean Up text (revoiew) and generate the n-grams.

Any idea please?

thank you

#delete rows with empty value columns
Data_clean <- Data[Data$Note!="" & Data$Review!="",]


Data_clean$id <- seq.int(nrow(Data_clean))

train.index <- 1:50000
test.index <- 50001:nrow(Data_clean)


#clean up
# remove grammar/punctuation
Data_clean$Review.clean <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))


train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]

temp.tf <- Data_clean$Raison.Reco.clean %>% tokens(ngrams = 1:2) %>% # generate tokens
      dfm  # generate dfm

Solution

  • You would not use ngrams for this, but rather a function called textstat_collocations().

    It's a bit hard to follow your exact example since none of those objects are explained or supplied, but let's try it with some of quanteda's built-in data. I'll get the texts from the inaugural corpus and apply some filters similar to what you have above.

    So to score bigrams for chi^2, you would use:

    # create the corpus, subset on some conditions (could be Note != "" for instance)
    corp_example <- data_corpus_inaugural
    corp_example <- corpus_subset(corp_example, Year > 1960)
    
    # this will remove punctuation and numbers
    toks_example <- tokens(corp_example, remove_punct = TRUE, remove_numbers = TRUE)
    
    # find and score chi^2 bigrams
    coll2 <- textstat_collocations(toks_example, method = "chi2", max_size = 2)
    head(coll2, 10)
    #             collocation count       X2
    # 1       reverend clergy     2 28614.00
    # 2       Majority Leader     2 28614.00
    # 3       Information Age     2 28614.00
    # 4      Founding Fathers     3 28614.00
    # 5  distinguished guests     3 28614.00
    # 6       Social Security     3 28614.00
    # 7         Chief Justice     9 23409.82
    # 8          middle class     4 22890.40
    # 9       Abraham Lincoln     2 19075.33
    # 10       society's ills     2 19075.33
    

    Added:

    # needs to be a list of the collocations as separate character elements
    coll2a <- sapply(coll2$collocation, strsplit, " ", USE.NAMES = FALSE)
    
    # compound the tokens using top 100 collocations
    toks_example_comp <- tokens_compound(toks_example, coll2a[1:100])
    toks_example_comp[[1]][1:20]
    # [1] "Vice_President"  "Johnson"         "Mr_Speaker"      "Mr_Chief"        "Chief_Justice"  
    # [6] "President"       "Eisenhower"      "Vice_President"  "Nixon"           "President"      
    # [11] "Truman"          "reverend_clergy" "fellow_citizens" "we"              "observe"        
    # [16] "today"           "not"             "a"               "victory"         "of"