As the the title says, I'm wondering if each mini-batch normalization happens based only on that mini-batche's own statistics or does it use moving averages/statistics across mini-batches (during training)?
Also, is there a way to force batch normalization to use moving averages/statistics across batches?
The motivation is that because of memory limitations, my batch size is quite small.
Thanks in advance.
Each mini-batch normalization happens based only on that mini-batche's own statistics .
to use moving averages/statistics across batches: Batch renormalization is another interesting approach for applying batch normalization to small batch sizes. The basic idea behind batch renormalization comes from the fact that we do not use the individual mini-batch statistics for batch normalization during inference. Instead, we use a moving average of the mini batch statistics. This is because a moving average provides a better estimate of the true mean and variance compared to individual mini-batches.
Then why don't we use the moving average during training? The answer has to do with the fact that during training, we need to perform backpropagation. In essence, when we use some statistics to normalize the data, we need to backpropagate through those statistics as well. If we use the statistics of activations from previous mini-batches to normalize the data, we need to account for how the previous layer affected those statistics during backpropagation. If we ignore these interactions, we could potentially cause previous layers to keep on increasing the magnitude of their activations even though it has no effect on the loss. This means that if we use a moving average, we would need to store the data from all previous mini-batches during training, which is far too expensive.
In batch renormalization, the authors propose to use a moving average while also taking the effect of previous layers on the statistics into account. Their method is - at its core - a simple reparameterization of normalization with the moving average. If we denote the moving average mean and standard deviation as 'mu' and 'sigma' and the mini-batch mean and standard deviation as mu_B and sigma_B , the batch renormalization equation is:
In other words, we multiply the batch normalized activations by r and add d , where both r and d are computed from the minibatch statistics and moving average statistics. The trick here is to not backpropagate through r and d . Though this means we ignore some of the effects of previous layers on previous mini batches, since the mini batch statistics and moving average statistics should be the same on average, the overall effect of this should cancel out on average as well.
Unfortunately, batch renormalization's performance still degrades when the batch size decreases (though not as badly as batch normalization), meaning group normalization still has a slight advantage in the small batch size regime.