Should we include the bias parameter in Conv2d
if we are going for Conv2d
followed by ReLU
followed by batch norm (bn)
?
There is no need if we go for Conv2d
followed by bn
followed by ReLU
, since the shift parameter of bn
takes care of bias work.
Yes, if the order is conv2d -> ReLU -> BatchNorm
, then having a bias
parameter in the convolution can help. To show that, let's assume that there is a bias in the convolution layer, and let's compare what happens with both of the orders you mention in the question. The idea is to see whether the bias is useful for each case.
Let's consider a single pixel from one of the convolution's output layers, and assume that x_1, ..., x_k
are the corresponding inputs (in vectorised form) from the batch (batch size == k
). We can write the convolution as
Wx+b #with W the convolution weights, b the bias
As you said in the question, when the order is conv2d-> BN -> ReLu
, then the bias is not useful because all it does to the distribution of the Wx
is shift it by b
, and this is cancelled out by the immediate BN layer:
(Wx_i - mu)/sigma ==> becomes (Wx_i + b - mu - b)/sigma i.e. no changes.
However, if you use the other order, i.e
BN(ReLU(Wx+b))
then ReLU
will map some of the Wx_i+b
to 0
· As a consequence, the mean will look like this:
(1/k)(0+...+0+ SUM_s (Wx_s+b))=some_term + b/k
and the std
will look like
const*((0-some_term-b/k)^2 + ... + (Wx_i+b - some_term -b/k)^2 +...))
and as you can see from expanding those therms that depend on non-zero Wx_i+b
:
(Wx_i+b - some_term - b/k)^2 = some_other_term + some_factor * W * b/k * x_i
which means that the result will depend on b
in a multiplicative manner. As a result, its absence can't just be compensated by the shift component of the BN layer (noted beta
in most implementation and papers). That is why having a bias term when using this order is not useless.