tensorflow keras normalization batch-normalization

BatchNormalization layer in Keras gives unexpected output values

Given the input values [1, 5] and normalizing them, should yield something like [-1, 1] if I understand correctly, because

mean = 3
var = 4
result = (x - mean) / sqrt(var)

However this minimal example

import numpy as np

import keras
from keras.models import Model
from keras.layers import Input
from keras.layers.normalization import BatchNormalization
from keras import backend as K

shape = (1,2,1)
input = Input(shape=shape)
x = BatchNormalization(center=False)(input) # no beta
model = Model(inputs=input, outputs=x)
model.compile(loss='mse', optimizer='sgd')

# training with dummy data
training_in = [np.random.random(size=(10, *shape))]
training_out = [np.random.random(size=(10, *shape))]
model.fit(training_in, training_out, epochs=10)

data_in = np.array([[[[1], [5]]]], dtype=np.float32)
data_out = model.predict(data_in)

print('gamma   :', K.eval(model.layers[1].gamma))
#print('beta    :', K.eval(model.layers[1].beta))
print('moving_mean:', K.eval(model.layers[1].moving_mean))
print('moving_variance:', K.eval(model.layers[1].moving_variance))

print('epsilon :', model.layers[1].epsilon)
print('data_in :', data_in)
print('data_out:', data_out)

produces the following output:

gamma   : [ 0.80644524]
moving_mean: [ 0.05885344]
moving_variance: [ 0.91000736]
epsilon : 0.001
data_in : [[[[ 1.]
   [ 5.]]]]
data_out: [[[[ 0.79519051]
   [ 4.17485714]]]]

So it is [0.79519051, 4.17485714] instead of [-1, 1].

I had a look at the source, and the values seem to be forwarded to tf.nn.batch_normalization. And this looks like the result should be what I except, but obviously it is not.

So how are the output values calculated?

Solution

If you're using gamma, the right equation is actually result = gamma * (x - mean) / sqrt(var) for batch normalization, BUT mean and var are not always the same:

During training (fit), they are mean_batch and var_batch calculated using the input values of the batch (they are just the mean and variance of your batch)), just as you're doing. In the meanwhile, a global moving_mean and moving_variance are learnt this way: moving_mean = alpha * moving_mean + (1-alpha) * mean_batch, with alpha is a kind of learning rate, in (0,1), usually above 0.9. moving_mean and moving_varianceare approximations of the real mean and variance of all your training data. Gamma is also learnt, by usual gradient descent, to best fit your output.
During inference (predict), you just use the learnt values of moving_mean and moving_variance, not at all mean_batch and var_batch. You also use the learnt gamma.

So 0.05885344 is just an approximation of the mean of your random input data, 0.91000736 of its variance, and you're using these to normalize your new data [1, 5]. You can easily chack check that [0.79519051, 4.17485714]=gamma * ([1, 5] - moving_mean)/sqrt(moving_var)

edit: alpha is called momentum in keras, if you want to check it.