Search code examples
rquanteda

how is PcGw computed in quanteda's Naive Bayes?


Consider the usual example that replicates example from 13.1 of An Introduction to Information Retrieval https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

txt <- c(d1 = "Chinese Beijing Chinese",
         d2 = "Chinese Chinese Shanghai",
         d3 = "Chinese Macao",
         d4 = "Tokyo Japan Chinese",
         d5 = "Chinese Chinese Chinese Tokyo Japan")

trainingset <- dfm(txt, tolower = FALSE)
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)

tmod1 <- textmodel_nb(trainingset, y = trainingclass, prior = "docfreq")

According to the docs, PcGw is the posterior class probability given the word. How it is computed? I thought what we cared about was the other way around, that is P(word / class).

> tmod1$PcGw
       features
classes   Chinese   Beijing  Shanghai     Macao     Tokyo     Japan
      N 0.1473684 0.2058824 0.2058824 0.2058824 0.5090909 0.5090909
      Y 0.8526316 0.7941176 0.7941176 0.7941176 0.4909091 0.4909091

Thanks!


Solution

  • The application is clearly explained in the book chapter you cite, but in essence, the different is that PcGw is the "probability of the class given the word", and PwGc is the "probability of the word given the class". The former is the posterior and what we need for computing the probability of a class membership for a group of words using the joint probability (in quanteda, this is applied using the predict() function). The latter is simply the likelihood that comes from the relative frequencies of the features in each class, smoothed by default by adding one to the counts by class.

    You can verify this if you want as follows. First, group the training documents by training class, and then smooth them.

    trainingset_bygroup <- dfm_group(trainingset[1:4, ], trainingclass[-5]) %>%
        dfm_smooth(smoothing = 1)
    trainingset_bygroup
    # Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
    # 2 x 6 sparse Matrix of class "dfm"
    #     features
    # docs Chinese Beijing Shanghai Macao Tokyo Japan
    #    N       2       1        1     1     2     2
    #    Y       6       2        2     2     1     1
    

    Then you can see that the (smoothed) word likelihoods are the same as PwGc.

    trainingset_bygroup / rowSums(trainingset_bygroup)
    # Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
    # 2 x 6 sparse Matrix of class "dfm"
    #     features
    # docs   Chinese   Beijing  Shanghai     Macao      Tokyo      Japan
    #    N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
    #    Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857
    
    tmod1$PwGc
    #        features
    # classes   Chinese   Beijing  Shanghai     Macao      Tokyo      Japan
    #       N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
    #       Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857
    

    But you probably care more about the P(class|word), since that's what Bayes formula is all about and incorporates the prior class probabilities P(c).