Search code examples
tensorflowmachine-learningkerastheanolasagne

Convert Lasagne BatchNormLayer to Keras BatchNormalization layer


I want to convert a pretrained Lasagne (Theano) model to a Keras (Tensorflow) model, so all layers need to have the exact same configuration. From both documentations it is not clear to me how the parameters correspond. Let's assume a Lasagne BatchNormLayer with default settings:

class lasagne.layers.BatchNormLayer(incoming, axes='auto', epsilon=1e-4, alpha=0.1, beta=lasagne.init.Constant(0), gamma=lasagne.init.Constant(1), mean=lasagne.init.Constant(0), inv_std=lasagne.init.Constant(1), **kwargs)

And this is the Keras BatchNormalization layer API:

keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)

Most of it is clear, so I'll provide the corresponding parameters for future reference here:

(Lasagne -> Keras)
incoming -> (not needed, automatic)
axes -> axis
epsilon -> epsilon
alpha -> ?
beta -> beta_initializer
gamma -> gamma_initializer
mean -> moving_mean_initializer
inv_std -> moving_variance_initializer
? -> momentum
? -> center
? -> scale
? -> beta_regularizer
? -> gamma_regularizer
? -> beta_constraint
? -> gamma_constraint

I assume Lasagne simply does not support beta_regularizer, gamma_regularizer, beta_constraint and gamma_constraint, so the default in Keras of None is correct. I also assume in Lasagne center and scale are always turned on and can not be turned off.

That leaves Lasagne alpha and Keras momentum. From the Lasagne documentation for alpha:

Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training; the closer to one, the more it will depend on the last batches seen

From the Keras documentation for momentum:

Momentum for the moving mean and the moving variance.

They seem to correspond -- but by which formula?


Solution

  • From the Lasagne code we see the usage of alpha like so:

    running_mean.default_update = ((1 - self.alpha) * running_mean +
                                   self.alpha * input_mean)
    running_inv_std.default_update = ((1 - self.alpha) *
                                      running_inv_std +
                                      self.alpha * input_inv_std)
    

    and from this issue discussing Keras batch norm 'momentum' we can see:

    def assign_moving_average(variable, value, decay, zero_debias=True, name=None):
        """Compute the moving average of a variable.
        The moving average of 'variable' updated with 'value' is:
          variable * decay + value * (1 - decay)
    
        ...
    

    where, as the issue notes, the TensorFlow term 'decay' is what takes on the value of 'momentum' from Keras.

    From this, it appears that what Lasagne calls 'alpha' is equal to 1 - 'momentum', since in Keras, 'momentum' is the multiplier of the existing variable (the existing moving average), while in Lasagne this multiplier is 1 - alpha.

    Admittedly it is confusing because

    • the TensorFlow operation underneath of Keras uses the term 'decay', but this is what Keras directly names 'momentum'.
    • the TensorFlow code merely names things as 'variable' and 'value' which makes it very hard to know which thing is the stored moving average and which thing is the additional new data to be combined in.