Search code examples
pythonmachine-learningneural-networkclassificationtheano

log-likelihood cost function: mean or sum?


In this code for computing the negative log-likelihood they say:

Note: we use the mean instead of the sum so that the learning rate is less dependent on the batch size

and this is how they get the negative log-likelihood:

return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])

This is true in many textbooks (e.g. Pattern Recognition and Machine Learning by Bishop), the negative log-likelihood is calculated by using the sum of each individual sample error rather than the mean. I still don't understand the note from the author. Every time when we calculate cost function should we use the mean rather than the sum? Even when we are not using batch?


Solution

  • The difference between the mean and the sum is just the multiplication by 1/N.

    The problem with using the sum is that the batch size (N) will influence your gradients. The learning rate indicates how much in the direction of the gradient you want to adjust your parameters.

    If your gradient is larger for larger batch sizes (N), it means that you will need to adjust the learning rate as you increase the batch size (N).

    In practice, in order to keep these two (learning rate and batch size) independent, it is common to use the mean instead of the sum. This makes the gradient magnitude independent of N.

    If you are not using a batch, then N=1 and the mean is the same as the sum.