Search code examples
rquanteda

How is chi-squared association / keyness calculated in quanteda?


I am trying to understand the chisquare calculation behind the associations (or correlation) of keywords in a target and reference group.

library(quanteda)    
pres_corpus <- corpus_subset(data_corpus_inaugural, President %in% c("Obama", "Trump"))

# Remove Punctuation and Numbers
tokensAll <- tokens(pres_corpus, remove_punct = TRUE, remove_numbers= TRUE)

# Removing stopwords before constructing bigrams
tokensNoStopwords <- tokens_remove(tokensAll, stopwords("english"))

# Bigram
tokensNgramsNoStopwords <- tokens_ngrams(tokensNoStopwords,  n=2, concatenator = "_")
dtm = dfm(tokensNgramsNoStopwords, tolower = TRUE, groups = "President")

# Calculate keyness and determine Trump as target group
(result_keyness <- textstat_keyness(dtm, target = "Obama"))[1]

The manual calculation of textstat_keyness() is shown below -

# Number of words
sums <- rowSums(dtm)

# frequency of target
a = as.numeric(dtm[1,1])

# frequency of reference
b = as.numeric(dtm[2,1])

# total of all target words minus freq. of target
c = sums[1] - a

# total of all reference words minus freq. of reference
d = sums[2] - b

N = (a+b+c+d)
E = (a+b)*(a+c) / N
(N * abs(a*d - b*c)^2) / ((a+b)*(c+d)*(a+c)*(b+d)) * ifelse(a > E, 1, -1)

It matches with the score derived from textstat_keyness( ) function. However, it does not match if I use chisq.test()-

(tt = as.table(rbind(c(a, b), c(c, d))))
suppressWarnings(chi <- stats::chisq.test(tt))
(t_exp <- chi$expected[1,1])
(chi2 = unname(chi$statistic) * ifelse(tt > t_exp, 1, -1))

Solution

  • The difference lies in the application of Yates's correction for the 2x2 Chi-squared test. chisq.test() applies the correction by default. In your manual computation, you have not applied the correction.

    So:

    textstat_keyness(dtm, target = "Obama")[1]
    ##           feature     chi2        p n_target n_reference
    ## 1 fellow_citizens 0.647129 0.421141        2           0
    

    And without the correction:

    chisq.test(tt, correct = FALSE)
    
    ##  Pearson's Chi-squared test
    ## 
    ## data:  tt
    ## X-squared = 0.64713, df = 1, p-value = 0.4211