Search code examples
javaalgorithmnaivebayes

Understanding algorithm - Multinomial Naive Bayes


I have been introduced to the Naive Bayes classification method (Multinomial NB), with reference to how it is described by Michael Sipser in his book "The Theory of Computation".

I am looking at the algorithm described for both training and applying multinomial NB, presented as follows:

enter image description here

However, I'm coming to a loss when interpreting certain aspects of the algorithm. For instance, in TRAINMULTINOMIALNB(C, D) on line 6:

  • What exactly CONCATENATE_TEXT_OF_ALL_DOCS_IN_CLASS(D, C) do?

So far, I understand it as follows. Suppose we have three - 3 - documents in class "movies" and "songs":

MOVIES
    DOC1 = "big fish"
    DOC2 = "big lebowski"
    DOC3 = "mystic river"

SONGS
    DOC1 = "purple rain"
    DOC2 = "crying in the rain"
    DOC3 = "anaconda"    

After applying CONCATENATE_TEXT_OF_ALL_DOCS_IN_CLASS(D, C), would you then be left with, say strings:

String concatenatedMovies = "big fish big lebowski mystic river" 
String concatenatedSongs = "purple rain crying in the rain anaconda" 

Is this right? Any help to understand this is highly appreciated.


Solution

  • In the end, you want to be able to clasify some text based on content. So you want to be able to say if its Songs or Movies, etc.
    In order to do that with Bayes (or other method), you first use your train data to build a model.

    First, by creating priors (docs in class / total docs) on line 5. Than you compute conditional probabilities (probability of word fish given the class MOVIES, probability of word rain given the class SONGS), lines 7-10. You simply divide the occurences of the term with the total number of terms in class (plus some smoothing -> +1). That is why you concatinate - to be able to count all occurences of a term in a class.
    In the end, you plug these values in Bayes formula and can categorize some unknonw document as MOVIES, SONGS, ... More wiki