Batch Normalization makes Training Faster?

I read everywhere that, in addition to improving performances regarding accuracy, "Batch Normalization makes Training Faster".

I probably misunderstand something (cause BN has been proven efficient more than once) but it seems king of unlogical to me.

Indeed, adding BN to a network, increases the number of parameters to learn : With BN comes "Scales" and "offset" parameters that are to learn. See: https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization

How can the network train faster while having "more work to do" ?

(I hope my question is legitimate or at least not too stupid).

Thank you :)

Solution

Batch normalization accelerates training by requiring less iterations to converge to a given loss value. This can be done by using higher learning rates, but with smaller learning rates you can still see an improvement. The paper shows this pretty clearly.

Using ReLU also has this effect when compared to a sigmoid activation, as shown in the original AlexNet paper (without BN).

Batch normalization also makes the optimization problem "easier", as minimizing the covariate shift avoid lots of plateaus where the loss stagnates or decreases slowly. It can still happen but it is much less frequent.