I am trying to understand the chisquare calculation behind the associations (or correlation) of keywords in a target and reference group.
library(quanteda)
pres_corpus <- corpus_subset(data_corpus_inaugural, President %in% c("Obama", "Trump"))
# Remove Punctuation and Numbers
tokensAll <- tokens(pres_corpus, remove_punct = TRUE, remove_numbers= TRUE)
# Removing stopwords before constructing bigrams
tokensNoStopwords <- tokens_remove(tokensAll, stopwords("english"))
# Bigram
tokensNgramsNoStopwords <- tokens_ngrams(tokensNoStopwords, n=2, concatenator = "_")
dtm = dfm(tokensNgramsNoStopwords, tolower = TRUE, groups = "President")
# Calculate keyness and determine Trump as target group
(result_keyness <- textstat_keyness(dtm, target = "Obama"))[1]
The manual calculation of textstat_keyness() is shown below -
# Number of words
sums <- rowSums(dtm)
# frequency of target
a = as.numeric(dtm[1,1])
# frequency of reference
b = as.numeric(dtm[2,1])
# total of all target words minus freq. of target
c = sums[1] - a
# total of all reference words minus freq. of reference
d = sums[2] - b
N = (a+b+c+d)
E = (a+b)*(a+c) / N
(N * abs(a*d - b*c)^2) / ((a+b)*(c+d)*(a+c)*(b+d)) * ifelse(a > E, 1, -1)
It matches with the score derived from textstat_keyness( ) function. However, it does not match if I use chisq.test()-
(tt = as.table(rbind(c(a, b), c(c, d))))
suppressWarnings(chi <- stats::chisq.test(tt))
(t_exp <- chi$expected[1,1])
(chi2 = unname(chi$statistic) * ifelse(tt > t_exp, 1, -1))
The difference lies in the application of Yates's correction for the 2x2 Chi-squared test. chisq.test()
applies the correction by default. In your manual computation, you have not applied the correction.
So:
textstat_keyness(dtm, target = "Obama")[1]
## feature chi2 p n_target n_reference
## 1 fellow_citizens 0.647129 0.421141 2 0
And without the correction:
chisq.test(tt, correct = FALSE)
## Pearson's Chi-squared test
##
## data: tt
## X-squared = 0.64713, df = 1, p-value = 0.4211