Consider the usual example that replicates example from 13.1 of An Introduction to Information Retrieval
https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
txt <- c(d1 = "Chinese Beijing Chinese",
d2 = "Chinese Chinese Shanghai",
d3 = "Chinese Macao",
d4 = "Tokyo Japan Chinese",
d5 = "Chinese Chinese Chinese Tokyo Japan")
trainingset <- dfm(txt, tolower = FALSE)
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)
tmod1 <- textmodel_nb(trainingset, y = trainingclass, prior = "docfreq")
According to the docs, PcGw
is the posterior class probability given the word
. How it is computed? I thought what we cared about was the other way around, that is P(word / class)
.
> tmod1$PcGw
features
classes Chinese Beijing Shanghai Macao Tokyo Japan
N 0.1473684 0.2058824 0.2058824 0.2058824 0.5090909 0.5090909
Y 0.8526316 0.7941176 0.7941176 0.7941176 0.4909091 0.4909091
Thanks!
The application is clearly explained in the book chapter you cite, but in essence, the different is that PcGw is the "probability of the class given the word", and PwGc is the "probability of the word given the class". The former is the posterior and what we need for computing the probability of a class membership for a group of words using the joint probability (in quanteda, this is applied using the predict()
function). The latter is simply the likelihood that comes from the relative frequencies of the features in each class, smoothed by default by adding one to the counts by class.
You can verify this if you want as follows. First, group the training documents by training class, and then smooth them.
trainingset_bygroup <- dfm_group(trainingset[1:4, ], trainingclass[-5]) %>%
dfm_smooth(smoothing = 1)
trainingset_bygroup
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
# features
# docs Chinese Beijing Shanghai Macao Tokyo Japan
# N 2 1 1 1 2 2
# Y 6 2 2 2 1 1
Then you can see that the (smoothed) word likelihoods are the same as PwGc
.
trainingset_bygroup / rowSums(trainingset_bygroup)
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
# features
# docs Chinese Beijing Shanghai Macao Tokyo Japan
# N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
# Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857
tmod1$PwGc
# features
# classes Chinese Beijing Shanghai Macao Tokyo Japan
# N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
# Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857
But you probably care more about the P(class|word), since that's what Bayes formula is all about and incorporates the prior class probabilities P(c).