I'm referring to the following page on Naive Bayes:
http://scikit-learn.org/stable/modules/naive_bayes.html
Specifically the equation beginning with y-hat. I think I generally understand the equations before that, but I don't understand the "arg max y" notation on that line. What does it mean?
Whereas the max
of a function is the value of the output at the maximum, the argmax
of a function is the value of the input ie the "argument" at the maximum.
In the equation in your example:
y_hat
is the value of y
, ie the the class label, that maximizes the right hand expression.
Here P(y)
is typically the proportion of class y
in the training set, also called the "prior", and P(x_i | y)
is the probability of observing the feature value x_i
if the true class is indeed y
, also called the "likelihood".
To understand the product P(x_i | y)
better, consider an example where you are trying to classify a sequence of coin flips as coming from either coin A
which lands heads in 50%
of training examples, or coin B
which lands heads in 66.7%
of training examples. Here each individual P(x_i | y_j)
is the probability of coin y_j
(where j
is either a
or b
) landing x_i
(where x_i
is either heads or tails).
Training set:
THH A
HTT A
HTH A
TTH A
HHH B
HTH B
TTH B
Test set:
HHT ?
So the sequence HHT
has a 0.667*0.667*0.333 = 0.148
likelihood given coin B
, but only a 0.5*0.5*0.5 = 0.125
likelihood given coin A
. However we estimate a 57%
prior for coin A
since A
appears in 4/7
training examples, so we would end up predicting that this sequence came from coin A
, since 0.57*0.125 > 0.43*0.148
. This is because we are more likely to start with coin A
, so coin A
has more chances to produce some less-likely sequences.
If the prior for coins A
and B
were 50%
each, then we would naturally predict coin B
for HHT
, since this sequence clearly has the highest likelihood given coin B
.