Search code examples
neural-networkconv-neural-networkbatch-normalizationactivation-functionrelu

Assuming the order Conv2d->ReLU->BN, should the Conv2d layer have a bias parameter?


Should we include the bias parameter in Conv2d if we are going for Conv2d followed by ReLU followed by batch norm (bn)?

There is no need if we go for Conv2d followed by bn followed by ReLU, since the shift parameter of bn takes care of bias work.


Solution

  • Yes, if the order is conv2d -> ReLU -> BatchNorm, then having a bias parameter in the convolution can help. To show that, let's assume that there is a bias in the convolution layer, and let's compare what happens with both of the orders you mention in the question. The idea is to see whether the bias is useful for each case.

    Let's consider a single pixel from one of the convolution's output layers, and assume that x_1, ..., x_k are the corresponding inputs (in vectorised form) from the batch (batch size == k). We can write the convolution as

    Wx+b #with W the convolution weights, b the bias
    

    As you said in the question, when the order is conv2d-> BN -> ReLu, then the bias is not useful because all it does to the distribution of the Wx is shift it by b, and this is cancelled out by the immediate BN layer:

    (Wx_i - mu)/sigma  ==> becomes (Wx_i + b - mu - b)/sigma i.e. no changes.
    

    However, if you use the other order, i.e

    BN(ReLU(Wx+b))
    

    then ReLU will map some of the Wx_i+b to 0· As a consequence, the mean will look like this:

    (1/k)(0+...+0+ SUM_s (Wx_s+b))=some_term + b/k
    

    and the std will look like

    const*((0-some_term-b/k)^2 + ... + (Wx_i+b - some_term -b/k)^2 +...)) 
    

    and as you can see from expanding those therms that depend on non-zero Wx_i+b:

    (Wx_i+b - some_term - b/k)^2 = some_other_term + some_factor * W * b/k * x_i
    

    which means that the result will depend on b in a multiplicative manner. As a result, its absence can't just be compensated by the shift component of the BN layer (noted beta in most implementation and papers). That is why having a bias term when using this order is not useless.