Search code examples
mathmachine-learningscikit-learnprobabilitynaivebayes

what does this arg max notation mean in the scikit-learn docs for Naive Bayes?


I'm referring to the following page on Naive Bayes:

http://scikit-learn.org/stable/modules/naive_bayes.html

Specifically the equation beginning with y-hat. I think I generally understand the equations before that, but I don't understand the "arg max y" notation on that line. What does it mean?


Solution

  • Whereas the max of a function is the value of the output at the maximum, the argmax of a function is the value of the input ie the "argument" at the maximum.

    max vs argmax

    In the equation in your example:

    enter image description here

    y_hat is the value of y, ie the the class label, that maximizes the right hand expression.

    Here P(y) is typically the proportion of class y in the training set, also called the "prior", and P(x_i | y) is the probability of observing the feature value x_i if the true class is indeed y, also called the "likelihood".

    To understand the product P(x_i | y) better, consider an example where you are trying to classify a sequence of coin flips as coming from either coin A which lands heads in 50% of training examples, or coin B which lands heads in 66.7% of training examples. Here each individual P(x_i | y_j) is the probability of coin y_j (where j is either a or b) landing x_i (where x_i is either heads or tails).

    Training set:
    
    THH    A
    HTT    A
    HTH    A
    TTH    A
    HHH    B
    HTH    B
    TTH    B
    
    Test set:
    
    HHT    ?
    

    So the sequence HHT has a 0.667*0.667*0.333 = 0.148 likelihood given coin B, but only a 0.5*0.5*0.5 = 0.125 likelihood given coin A. However we estimate a 57% prior for coin A since A appears in 4/7 training examples, so we would end up predicting that this sequence came from coin A, since 0.57*0.125 > 0.43*0.148. This is because we are more likely to start with coin A, so coin A has more chances to produce some less-likely sequences.

    If the prior for coins A and B were 50% each, then we would naturally predict coin B for HHT, since this sequence clearly has the highest likelihood given coin B.